Skip to content

Crash on loading model from different TensorRT engines #410

Open
@filipecosta90

Description

@filipecosta90

according to Nvidia support TensorRT engines are not compatible across different TensorRT versions, which leads to the following errors being loogged:

2020-06-16 13:43:36.634318: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger ../rtSafe/coreReadArchive.cpp (31) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
2020-06-16 13:43:36.634451: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger INVALID_STATE: std::exception
2020-06-16 13:43:36.634476: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger INVALID_CONFIG: Deserialize the cuda engine failed.

however RedisAI is crashing on libtensorflow.so due to accessing invalid an memory location. we need to safeguard against it.

machine avaiable tensorrt engines

$ dpkg -l | grep nvinfer
ii  libnvinfer-dev                            6.0.1-1+cuda10.1                      amd64        TensorRT development libraries and headers
ii  libnvinfer-plugin6                        6.0.1-1+cuda10.1                      amd64        TensorRT plugin libraries
ii  libnvinfer-plugin7                        7.0.0-1+cuda10.2                      amd64        TensorRT plugin libraries
ii  libnvinfer5                               5.1.5-1+cuda10.1                      amd64        TensorRT runtime libraries
ii  libnvinfer6                               6.0.1-1+cuda10.1                      amd64        TensorRT runtime libraries
ii  libnvinfer7                               7.0.0-1+cuda10.2                      amd64        TensorRT runtime libraries
ii  python3-libnvinfer                        7.0.0-1+cuda10.2                      amd64        Python 3 bindings for TensorRT

Here is the crash report:

$ taskset -c 0-2 redis-server  --port 6379  --protected-mode no --save  --appendonly no --loadmodule install-gpu/redisai.so
16404:C 16 Jun 2020 13:43:24.673 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16404:C 16 Jun 2020 13:43:24.673 # Redis version=6.0.5, bits=64, commit=51efb7fe, modified=0, pid=16404, just started
16404:C 16 Jun 2020 13:43:24.673 # Configuration loaded
16404:M 16 Jun 2020 13:43:24.674 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16404:M 16 Jun 2020 13:43:24.674 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16404:M 16 Jun 2020 13:43:24.674 # Current maximum open files is 8192. maxclients has been reduced to 8160 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 6.0.5 (51efb7fe/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 16404
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

16404:M 16 Jun 2020 13:43:24.674 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16404:M 16 Jun 2020 13:43:24.674 # Server initialized
16404:M 16 Jun 2020 13:43:24.674 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
16404:M 16 Jun 2020 13:43:24.674 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
16404:M 16 Jun 2020 13:43:24.674 * <ai> Redis version found by RedisAI: 6.0.5 - oss
16404:M 16 Jun 2020 13:43:24.674 * <ai> RedisAI version 999999, git_sha=a5d15247c28b0c3e8d97351d418f593cac7c7d41
16404:M 16 Jun 2020 13:43:24.675 * Module 'ai' loaded from install-gpu/redisai.so
16404:M 16 Jun 2020 13:43:24.675 * Ready to accept connections
16404:M 16 Jun 2020 13:43:32.243 # <ai> backend TF not loaded, will try loading default backend
16404:M 16 Jun 2020 13:43:32.316 * <ai> TF backend loaded from install-gpu/backends/redisai_tensorflow/redisai_tensorflow.so
2020-06-16 13:43:32.391652: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300060000 Hz
2020-06-16 13:43:32.391911: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ba47abd5c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-16 13:43:32.391945: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-16 13:43:32.394074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-16 13:43:32.414912: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-16 13:43:32.415869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
2020-06-16 13:43:32.416144: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-16 13:43:32.418113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-16 13:43:32.419909: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-16 13:43:32.420249: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-16 13:43:32.422368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-16 13:43:32.423570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-16 13:43:32.428018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-16 13:43:32.428106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-16 13:43:32.429060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-16 13:43:32.429992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-06-16 13:43:32.430031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-16 13:43:32.588536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-16 13:43:32.588581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-06-16 13:43:32.588597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-06-16 13:43:32.588731: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-16 13:43:32.589721: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-16 13:43:32.590647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-16 13:43:32.591532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15052 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2020-06-16 13:43:36.335299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-16 13:43:36.634318: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger ../rtSafe/coreReadArchive.cpp (31) - Serialization Error in verifyHeader: 0 (Magic tag does not match)
2020-06-16 13:43:36.634451: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger INVALID_STATE: std::exception
2020-06-16 13:43:36.634476: E tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:41] DefaultLogger INVALID_CONFIG: Deserialize the cuda engine failed.


=== REDIS BUG REPORT START: Cut & paste starting from here ===
16404:M 16 Jun 2020 13:43:36.634 # Redis 6.0.5 crashed by signal: 11
16404:M 16 Jun 2020 13:43:36.634 # Crashed running the instruction at: 0x7f3210b99afd
16404:M 16 Jun 2020 13:43:36.634 # Accessing address: (nil)
16404:M 16 Jun 2020 13:43:36.634 # Failed assertion: <no assertion failed> (<no file>:0)

------ STACK TRACE ------
EIP:
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow.so.1(+0x77e5afd)[0x7f3210b99afd]

Backtrace:
redis-server *:6379(logStackTrace+0x5a)[0x55ba45e60b7a]
redis-server *:6379(sigsegvHandler+0xb1)[0x55ba45e61331]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f322e9f2890]
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow.so.1(+0x77e5afd)[0x7f3210b99afd]
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow.so.1(+0x77ebb19)[0x7f3210b9fb19]
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice12ComputeAsyncEPNS_13AsyncOpKernelEPNS_15OpKernelContextESt8functionIFvvEE+0xf7)[0x7f32085b5727]
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow_framework.so.1(+0xfe678f)[0x7f320861278f]
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow_framework.so.1(+0xfe7ecf)[0x7f3208613ecf]
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x291)[0x7f32086ba3f1]
/home/ubuntu/RedisAI/install-gpu/backends/redisai_tensorflow/lib/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f32086b7a68]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f32073606df]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f322e9e76db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f322e71088f]

------ INFO OUTPUT ------
# Server
redis_version:6.0.5
redis_git_sha1:51efb7fe
redis_git_dirty:0
redis_build_id:c91f8bc14d762a3e
redis_mode:standalone
os:Linux 5.3.0-1023-aws x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:7.5.0
process_id:16404
run_id:d47b395987887757cb872ccdcf2993f58a6a3638
tcp_port:6379
uptime_in_seconds:12
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:15256712
executable:/home/ubuntu/RedisAI/redis-server
config_file:

# Clients
connected_clients:2
client_recent_max_input_buffer:34277758
client_recent_max_output_buffer:0
blocked_clients:1
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:44116752
used_memory_human:42.07M
used_memory_rss:1318522880
used_memory_rss_human:1.23G
used_memory_peak:84765920
used_memory_peak_human:80.84M
used_memory_peak_perc:52.05%
used_memory_overhead:827724
used_memory_startup:793680
used_memory_dataset:43289028
used_memory_dataset_perc:99.92%
allocator_allocated:44580704
allocator_active:44851200
allocator_resident:133017600
total_system_memory:64277659648
total_system_memory_human:59.86G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.01
allocator_frag_bytes:270496
allocator_rss_ratio:2.97
allocator_rss_bytes:88166400
rss_overhead_ratio:9.91
rss_overhead_bytes:1185505280
mem_fragmentation_ratio:29.89
mem_fragmentation_bytes:1274407664
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:33972
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1
rdb_bgsave_in_progress:0
rdb_last_save_time:1592315004
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:3
total_commands_processed:11
instantaneous_ops_per_sec:1
total_net_input_bytes:34880428
total_net_output_bytes:28944
instantaneous_input_kbps:367.72
instantaneous_output_kbps:4.45
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
keyspace_hits:1
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0

# Replication
role:master
connected_slaves:0
master_replid:728851d0b17b5a0417012fcd4d4edb105befc9cb
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.629231
used_cpu_user:1.262905
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

# Modules
module:name=ai,ver=999999,api=1,filters=0,usedby=[],using=[],options=[]

# Commandstats
cmdstat_ai.modelset:calls=1,usec=370516,usec_per_call=370516.00
cmdstat_ai.dagrun:calls=1,usec=41,usec_per_call=41.00
cmdstat_info:calls=9,usec=584,usec_per_call=64.89

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1,expires=0,avg_ttl=0

------ CLIENT LIST OUTPUT ------
id=6 addr=127.0.0.1:51484 fd=8 name= age=7 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=info user=default
id=8 addr=10.3.0.28:35030 fd=9 name= age=1 idle=1 flags=b db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ai.dagrun user=default

------ REGISTERS ------
16404:M 16 Jun 2020 13:43:36.638 # 
RAX:0000000000000000 RBX:00007f31d59686e0
RCX:0000000000000007 RDX:0000000000000000
RDI:0000000000000000 RSI:00007f3154000a08
RBP:00007f31d5968670 RSP:00007f31d59683b0
R8 :00007f30f34f7ab8 R9 :0000000000000008
R10:00000000ffffffbf R11:0000000000000000
R12:00007f30e83c31d0 R13:0000000000000000
R14:00007f310c9bc990 R15:00007f3138e57fc0
RIP:00007f3210b99afd EFL:0000000000010206
CSGSFS:002b000000000033
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683bf) -> 0000000000000000
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683be) -> 0000000000000030
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683bd) -> 0000000000000007
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683bc) -> 0000000300000000
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683bb) -> 00007f31d59686c0
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683ba) -> 00007f3138e581f0
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b9) -> 0000000000000001
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b8) -> 00007f310c9bc9a8
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b7) -> 0000000000000000
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b6) -> 0000000000000000
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b5) -> 0000000108368629
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b4) -> 00007f31d59685d0
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b3) -> 00000001d59693d0
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b2) -> 00007f31d59685f0
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b1) -> 00007f31d5968c48
16404:M 16 Jun 2020 13:43:36.638 # (00007f31d59683b0) -> 00007f3154000cb0

------ MODULES INFO OUTPUT ------

------ FAST MEMORY TEST ------
(...)
.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.Segmentation fault (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions