|
| 1 | +# Release r1.15.5-deeprec2304 |
| 2 | + |
| 3 | +## **Major Features and Improvements** |
| 4 | + |
| 5 | +### **Embedding** |
| 6 | + |
| 7 | +- Suport tf.int32 dtype using feature_column API `tf.feature_column.categorical_column_with_embedding`. |
| 8 | +- Make the rules of export frequencies and versions the same as the rule of export keys. |
| 9 | +- Optimize cuda kernel implementation in GroupEmbedding. |
| 10 | +- Support to read embedding files with mmap and madvise, and direct IO. |
| 11 | +- Add double check in find_wait_free of lockless dense hashmap. |
| 12 | +- Change Embedding init value of version in EV from 0 to -1. |
| 13 | +- Interface 'GetSnapshot()' backward compatibility. |
| 14 | +- Implement CPU GroupEmbedding lookup sparse Op. |
| 15 | +- Make GroupEmbedding compatible with sequence feature_column interface. |
| 16 | +- Fix sp_weights indices calculation error in GroupEmbedding. |
| 17 | +- Add group_strategy to control parallelism of group_embedding. |
| 18 | + |
| 19 | +### **Graph & Grappler Optimization** |
| 20 | + |
| 21 | +- Support SparseTensor as placeholder in Sample-awared Graph Compression. |
| 22 | +- Add Dice fusion grappler and ops. |
| 23 | +- Enable MKL Matmul + Bias + LeakyRelu fusion. |
| 24 | + |
| 25 | +### **Runtime Optimization** |
| 26 | + |
| 27 | +- Avoid unnecessary polling in EventMgr. |
| 28 | +- Reduce lock cost and memory usage in EventMgr when use multi-stream. |
| 29 | + |
| 30 | +### **Ops & Hardware Acceleration** |
| 31 | + |
| 32 | +- Register GPU implementation of int64 type for Prod. |
| 33 | +- Register GPU implementation of string type for Shape, ShapeN and ExpandDims. |
| 34 | +- Optimize list of GPU SegmentReductionOps. |
| 35 | +- Optimize zeros_like_impl by reducing calls to convert_to_tensor. |
| 36 | +- Implement GPU version of SparseSlice Op. |
| 37 | +- Delay Reshape when rank > 2 in keras.layers.Dense so that post op can be fused with MatMul. |
| 38 | +- Implement setting max_num_threads hint to oneDNN at compile time. |
| 39 | +- Implement TensorPackTransH2DOp to improve SmartStage performance on GPU. |
| 40 | + |
| 41 | +### **IO** |
| 42 | + |
| 43 | +- Add tensor shape meta-data support for ParquetDataset. |
| 44 | +- Add arrow BINARY type support for ParquetDataset. |
| 45 | + |
| 46 | +### **Serving** |
| 47 | + |
| 48 | +- Add Dice fusion to inference mode. |
| 49 | +- Enable INFERENCE_MODE in processor. |
| 50 | +- Support TensorRT 8.x in Inference. |
| 51 | +- Add configure filed to control enable TensorRT or not. |
| 52 | +- Add flag for device_placement_optimization. |
| 53 | +- Avoid to clustering feature column related nodes when enable TensorRT. |
| 54 | +- Optimize inference latency when load increment checkpoint. |
| 55 | +- Optimize performance via only place TensorRT ops to gpu device. |
| 56 | + |
| 57 | +### **Environment & Build** |
| 58 | + |
| 59 | +- Support CUDA 12. |
| 60 | +- Update DEFAULT_CUDA_VERSION and DEFAULT_CUDNN_VERSION in configure.py. |
| 61 | +- Move thirdparties from WORKSPACE to workspace.bzl. |
| 62 | +- Update urls corresponding to colm, ragel, aliyun-oss-sdk and uuid. |
| 63 | + |
| 64 | +### **BugFix** |
| 65 | + |
| 66 | +- Fix constant op placing bug for device placement optimization. |
| 67 | +- Fix Nan issue occurred in group_embedding API. |
| 68 | +- Fix SOK not compatible with variable issue. |
| 69 | +- Fix memory leak when update full model in serving. |
| 70 | +- Fix 'cols_to_output_tensors' not setted issue in GroupEmbedding. |
| 71 | +- Fix core dump issue about saving GPU EmbeddingVariable. |
| 72 | +- Fix cuda resource issue in KvResourceImportV3 kernel. |
| 73 | +- Fix loading signature_def with coo_sparse bug and add UT. |
| 74 | +- Fix the bug that the training ends early when the workqueue is enabled. |
| 75 | +- Fix the control edge connection issue in device placement optimization. |
| 76 | + |
| 77 | +### **ModelZoo** |
| 78 | + |
| 79 | +- Modify GroupEmbedding related function usage. |
| 80 | +- Update masknet example with layernorm. |
| 81 | + |
| 82 | +### **Tool & Documents** |
| 83 | + |
| 84 | +- Add tools for remove filtered features in checkpoint. |
| 85 | +- Add Arm Compute Library (ACL) user documents. |
| 86 | +- Update Embedding Variable document to fix initializer config example. |
| 87 | +- Update GroupEmbedding document. |
| 88 | +- Update processor documents. |
| 89 | +- Add user documents for intel AMX. |
| 90 | +- Add TensorRT usage documents. |
| 91 | +- Update documents for ParquetDataset. |
| 92 | + |
| 93 | +More details of features: [https://deeprec.readthedocs.io/zh/latest/](url) |
| 94 | + |
| 95 | +## **Release Images** |
| 96 | + |
| 97 | +### **CPU Image** |
| 98 | + |
| 99 | +`alideeprec/deeprec-release:deeprec2304-cpu-py38-ubuntu20.04` |
| 100 | + |
| 101 | +### **GPU Image** |
| 102 | + |
| 103 | +`alideeprec/deeprec-release:deeprec2304-gpu-py38-cu116-ubuntu20.04` |
| 104 | + |
1 | 105 | # Release r1.15.5-deeprec2302
|
2 | 106 |
|
3 | 107 | ## **Major Features and Improvements**
|
|
0 commit comments