ray-project
diff --git a/‎doc/BUILD
Lines changed: 29 additions & 9 deletions b/‎doc/BUILD
Lines changed: 29 additions & 9 deletions
diff --git a/‎doc/source/train/deepspeed.rst
Lines changed: 2 additions & 1 deletion b/‎doc/source/train/deepspeed.rst
Lines changed: 2 additions & 1 deletion
diff --git a/‎doc/source/train/distributed-tensorflow-keras.rst
Lines changed: 17 additions & 8 deletions b/‎doc/source/train/distributed-tensorflow-keras.rst
Lines changed: 17 additions & 8 deletions
diff --git a/‎doc/source/train/examples/pytorch/torch_data_prefetch_benchmark/benchmark_example.rst
Lines changed: 2 additions & 1 deletion b/‎doc/source/train/examples/pytorch/torch_data_prefetch_benchmark/benchmark_example.rst
Lines changed: 2 additions & 1 deletion
diff --git a/‎doc/source/train/getting-started-pytorch-lightning.rst
Lines changed: 29 additions & 10 deletions b/‎doc/source/train/getting-started-pytorch-lightning.rst
Lines changed: 29 additions & 10 deletions
diff --git a/‎doc/source/train/getting-started-pytorch.rst
Lines changed: 22 additions & 8 deletions b/‎doc/source/train/getting-started-pytorch.rst
Lines changed: 22 additions & 8 deletions
@@ -312,8 +312,8 @@ doctest(
             "source/serve/production-guide/fault-tolerance.md",
             "source/data/batch_inference.rst",
             "source/data/transforming-data.rst",
-            "source/train/faq.rst",
-	    "source/train/user-guides/data-loading-preprocessing.rst",
+            "source/train/**/*.rst",
+            "source/train/**/*.md",
             "source/workflows/**/*.rst",
             "source/workflows/**/*.md",
             "source/rllib/**/*.rst",
@@ -326,6 +326,33 @@ doctest(
 )
 
 
+doctest(
+    name="doctest[train]",
+    files = glob(
+        include=[
+            "source/train/**/*.rst",
+            "source/train/**/*.md"
+        ],
+        exclude=[
+            # GPU
+            "source/train/user-guides/data-loading-preprocessing.rst",
+            "source/train/user-guides/using-gpus.rst"
+        ]
+    ),
+    tags = ["team:ml"]
+)
+
+doctest(
+    name="doctest[train]",
+    files = [
+        "source/train/user-guides/data-loading-preprocessing.rst",
+        "source/train/user-guides/using-gpus.rst"
+    ],
+    tags = ["team:ml"],
+    gpu = True,
+)
+
+
 doctest(
     name="doctest[workflow]",
     files = glob(
@@ -362,10 +389,3 @@ doctest(
     gpu = True
 )
 
-doctest(
-    name="quarantine",
-    files = [
-	"source/train/user-guides/data-loading-preprocessing.rst",
-    ],
-    tags = ["team:data"],
-)
 
@@ -10,7 +10,8 @@ Code example
 
 You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this:
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     import deepspeed
     from deepspeed.accelerator import get_accelerator
 
@@ -46,7 +46,8 @@ variable set up for you.
 The `MultiWorkerMirroredStrategy <https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy>`_
 enables synchronous distributed training. You *must* build and compile the ``Model`` within the scope of the strategy.
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     with tf.distribute.MultiWorkerMirroredStrategy().scope():
         model = ... # build model
@@ -81,7 +82,12 @@ execute training. For distributed Tensorflow,
 use a :class:`~ray.train.tensorflow.TensorflowTrainer`
 that you can setup like this:
 
-.. code-block:: python
+.. testcode::
+    :hide:
+
+    train_func = lambda: None
+
+.. testcode::
 
     from ray.train import ScalingConfig
     from ray.train.tensorflow import TensorflowTrainer
@@ -95,7 +101,8 @@ that you can setup like this:
 To customize the backend setup, you can pass a
 :class:`~ray.train.tensorflow.TensorflowConfig`:
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     from ray.train import ScalingConfig
     from ray.train.tensorflow import TensorflowTrainer, TensorflowConfig
@@ -116,7 +123,8 @@ Run a training function
 With a distributed training function and a Ray Train ``Trainer``, you are now
 ready to start training.
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     trainer.fit()
 
@@ -138,7 +146,7 @@ API for model training.
 `See this example <https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tf/tune_tensorflow_autoencoder_example.py>`__
 for distributed data loading. The relevant parts are:
 
-.. code-block:: python
+.. testcode::
 
     import tensorflow as tf
     from ray import train
@@ -188,7 +196,7 @@ local log files. The logging also triggers :ref:`checkpoint bookkeeping <train-d
 The easiest way to report your results with Keras is by using the
 :class:`~ray.train.tensorflow.keras.ReportCheckpointCallback`:
 
-.. code-block:: python
+.. testcode::
 
     from ray.train.tensorflow.keras import ReportCheckpointCallback
 
@@ -223,8 +231,9 @@ attribute.
 These concrete examples demonstrate how Ray Train appropriately saves checkpoints, model weights but not models, in distributed training.
 
 
-.. code-block:: python
+.. testcode::
 
+    import json
     import os
     import tempfile
 
@@ -275,7 +284,7 @@ directory <train-log-dir>` of each run.
 Load checkpoints
 ~~~~~~~~~~~~~~~~
 
-.. code-block:: python
+.. testcode::
 
     import os
     import tempfile
 
@@ -8,7 +8,8 @@ Torch Data Prefetching Benchmark for Ray Train
 We provide a benchmark example to show how the auto pipeline for host to device data transfer speeds up training on GPUs.
 This functionality can be easily enabled by setting ``auto_transfer=True`` in :func:`train.torch.prepare_data_loader`.
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     from torch.utils.data import DataLoader
     from ray import train
 
@@ -17,14 +17,15 @@ Quickstart
 
 For reference, the final code is as follows:
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     from ray.train.torch import TorchTrainer
     from ray.train import ScalingConfig
 
     def train_func(config):
         # Your PyTorch Lightning training code here.
-    
+
     scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
     trainer = TorchTrainer(train_func, scaling_config=scaling_config)
     result = trainer.fit()
@@ -39,7 +40,10 @@ Compare a PyTorch Lightning training script with and without Ray Train.
 
     .. group-tab:: PyTorch Lightning
 
-        .. code-block:: python
+        .. This snippet isn't tested because it doesn't use any Ray code.
+
+        .. testcode::
+            :skipif: True
 
             import torch
             from torchvision.models import resnet18
@@ -154,7 +158,8 @@ Set up a training function
 First, update your training code to support distributed training. 
 Begin by wrapping your code in a :ref:`training function <train-overview-training-function>`:
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     def train_func(config):
         # Your PyTorch Lightning training code here.
@@ -324,7 +329,7 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob
 1. `num_workers` - The number of distributed training worker processes.
 2. `use_gpu` - Whether each worker should use a GPU (or CPU).
 
-.. code-block:: python
+.. testcode::
 
     from ray.train import ScalingConfig
     scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
@@ -338,7 +343,15 @@ Launch a training job
 Tying this all together, you can now launch a distributed training job 
 with a :class:`~ray.train.torch.TorchTrainer`.
 
-.. code-block:: python
+.. testcode::
+    :hide:
+
+    from ray.train import ScalingConfig
+
+    train_func = lambda: None
+    scaling_config = ScalingConfig(num_workers=1)
+
+.. testcode::
 
     from ray.train.torch import TorchTrainer
 
@@ -353,7 +366,7 @@ Access training results
 After training completes, Ray Train returns a :class:`~ray.train.Result` object, which contains
 information about the training run, including the metrics and checkpoints reported during training.
 
-.. code-block:: python
+.. testcode::
 
     result.metrics     # The metrics reported during training.
     result.checkpoint  # The latest checkpoint reported during training.
@@ -407,9 +420,11 @@ control over their native Lightning code.
 
     .. group-tab:: (Deprecating) LightningTrainer
 
+        .. This snippet isn't tested because it raises a hard deprecation warning.
+
+        .. testcode::
+            :skipif: True
 
-        .. code-block:: python
-            
             from ray.train.lightning import LightningConfigBuilder, LightningTrainer
 
             config_builder = LightningConfigBuilder()
@@ -449,9 +464,13 @@ control over their native Lightning code.
 
     .. group-tab:: (New API) TorchTrainer
 
-        .. code-block:: python
+        .. This snippet isn't tested because it runs with 4 GPUs, and CI is only run with 1.
+
+        .. testcode::
+            :skipif: True
 
             import lightning.pytorch as pl
+            from ray.air import CheckpointConfig, RunConfig
             from ray.train.torch import TorchTrainer
             from ray.train.lightning import (
                 RayDDPStrategy, 
 
@@ -18,7 +18,8 @@ Quickstart
 
 For reference, the final code is as follows:
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     from ray.train.torch import TorchTrainer
     from ray.train import ScalingConfig
@@ -40,7 +41,10 @@ Compare a PyTorch training script with and without Ray Train.
 
     .. group-tab:: PyTorch
 
-        .. code-block:: python
+        .. This snippet isn't tested because it doesn't use any Ray code.
+
+        .. testcode::
+            :skipif: True
 
             import tempfile
             import torch
@@ -138,7 +142,8 @@ Set up a training function
 First, update your training code to support distributed training. 
 Begin by wrapping your code in a :ref:`training function <train-overview-training-function>`:
 
-.. code-block:: python
+.. testcode::
+    :skipif: True
 
     def train_func(config):
         # Your PyTorch training code here.
@@ -212,8 +217,9 @@ See :ref:`data-ingest-torch`.
     Keep in mind that ``DataLoader`` takes in a ``batch_size`` which is the batch size for each worker.
     The global batch size can be calculated from the worker batch size (and vice-versa) with the following equation:
 
-    .. code-block:: python
-
+    .. testcode::
+        :skipif: True
+        
         global_batch_size = worker_batch_size * ray.train.get_context().get_world_size()
 
 
@@ -248,7 +254,7 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob
 1. :class:`num_workers <ray.train.ScalingConfig>` - The number of distributed training worker processes.
 2. :class:`use_gpu <ray.train.ScalingConfig>` - Whether each worker should use a GPU (or CPU).
 
-.. code-block:: python
+.. testcode::
 
     from ray.train import ScalingConfig
     scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
@@ -262,7 +268,15 @@ Launch a training job
 Tying this all together, you can now launch a distributed training job 
 with a :class:`~ray.train.torch.TorchTrainer`.
 
-.. code-block:: python
+.. testcode::
+    :hide:
+
+    from ray.train import ScalingConfig
+
+    train_func = lambda: None
+    scaling_config = ScalingConfig(num_workers=1)
+
+.. testcode::
 
     from ray.train.torch import TorchTrainer
 
@@ -275,7 +289,7 @@ Access training results
 After training completes, a :class:`~ray.train.Result` object is returned which contains
 information about the training run, including the metrics and checkpoints reported during training.
 
-.. code-block:: python
+.. testcode::
 
     result.metrics     # The metrics reported during training.
     result.checkpoint  # The latest checkpoint reported during training.