Dataloading Revamp #3216

AntonioMacaronio · 2024-06-12T11:21:55Z

Problems and Background

With a sufficiently large enough dataset, the current parallel_datamanager.py will try to cache the entire dataset into RAM, which will lead to an out-of-memory (OOM) error
parallel_datamanager.py only uses one worker to generate ray bundles. Since various subprocesses such as unprojecting during ray generation, or pixel sampling within a custom mask can be a CPU-intensive task, it may be better suited to parallelize this. While parallel_datamanager.py does support multiple workers, each worker caches the entire dataset to RAM and it does not support massive datasets, leading to duplicate copies of the dataset in computer memory. It also implements parallelism from scratch and is not friendly to build off.
Additionally, both VanillaDataManager and ParallelDataManager rely on CacheDataloader, which subclasses torch.utils.data.DataLoader, which is a strange coding practice, and actually serves no particular use in the current nerfstudio implementation.
Similarly for full_images_datamanager.py: As we can not fit the entire dataset in RAM, the current implementation loads in entire dataset into the FullImageDataloader's cached_train attribute. To do this efficiently, we need multiprocess parallelization to load in images, undistort them, and do this quickly to keep up with GPU's forward and backward passes of the model.

Overview of Changes

Replacing CacheDataloader with RayBatchStream, which subclasses torch.utils.data.IterableDataset. The goal of this class is to generate ray bundles directly without caching all images to RAM. This is done by collating a sampled batch of images to sample from. A new ParallelDatamanager class is written which is available side-by-side but can completely replace the original VanillaDatamanager
Adding an ImageBatchStream to create a parallel, OOM-resistant version of FullImageDataManager. This can be configured to load from the disk by setting cache_images config variable to disk.
A new pil_to_numpy() function is added. This function reads a PIL.Image's data buffer and fills an empty numpy array while reading, hastening the conversion process and removing an extra memory allocation. It is the fastest way to get from a PIL Image to a Pytorch tensor averaging ~2.5ms for a 1080x1920 image (~40% faster)
A new flag called cache_compressed_imgs now caches your images to RAM in their compressed form (for example, caching) and relies on parallelized CPU dataloading to efficiently decode them into pytorch tensors to be used in training.
Resolving some pyright issues: Within nerfstudio/scripts/exporter.py, the assertions for ExportPointCloud and ExportPoissonMesh were modified because these are only used on NeRFs, so exporting for splats (has its own export method) and RandomCameraDatamanger (outdated) were removed. Similarly, some "# type: ignore" were added to various runtime checked locations that pyright could not detect. This was in base_pipeline.py and trainer.py.

Impact

Checkout these comparisons! The left was trained on 200 images of a 4k video, while the right was trained on 2000 images of the same 4k video.

…te_fn

pwais

nice progress! sorry its not fast but i think i know why:

i think the main reason this is slower than expected is because _get_collated_batch() gets called per raybundle and sadly _get_collated_batch() is AFAIK needlessly slow.

take note about how the current CachedDataloader avoids doing _get_collated_batch() per raybundle. it would have been nice for the author to have left some notes about how slow _get_collated_batch() is, but evidently that author found it's necessary to not collate images per raybundle .
in my impl, I just _get_collated_batch() once on a small set of images an keep that batch cached. the main problem I saw is that _get_collated_batch() on thousands of images seemed to use 2x or 3x as much RAM as actually needed and thus cause many minutes of swapping and stuff

Even if you only call _get_collated_batch() once tho, you might need a bigger prefetch factor and/or more workers depending on the model.

IMO it's worth trying to find a way to get the result of nerfstudio_collate on cameras (I think the cameras do need to be collated because they can be ragged? i could be wrong and they don't need collation) but on images just have the worker read image files / buffers and never call collate on those tensors.

Just to be clear, this is the line where collate on images can go nuts and start taking forever to allocate 200GB or more of RAM for many images in code in main:

nerfstudio/nerfstudio/data/utils/nerfstudio_collate.py

Line 101 in 3d27dd4

storage = elem.storage()._new_shared(numel, device=elem.device)

So! If a worker is just emitting raybundles then the images never need to be in shared tensor memory then eh? Thus should be able to save some RAM and CPU by skipping that line for images. Still need to think about the cost of reading the images themselves, but collate is definitely a troublemaker.

nerfstudio/data/datamanagers/base_datamanager.py

…ndles

pwais

just took a quick look (can't do a full review right now), so cool to see this coming along!!

Sounds like this change will target the case that uncompressed image tensors can't fit in RAM, but the raw image files (typically jpeg) do fit in RAM. In that case I guess we do want each worker to literally load the file bytes into Python RAM (as implemented) versus let the OS disk cache work, because the idea is that the uncompressed image tensors will otherwise blow out the disk cache.

I think it would be important to test in the end like a case where the user only has limited RAM (say 16GB) and e.g. a 8GB laptop graphics card, in that case I think there are moderate or larger image datasets where the whole thing would OOM when using the current cache impl. In that case, it would be helpful to have some way to disable the cache, or just communicate to the user that they simply have too weak of a machine for the dataset (e.g. just a CONSOLE.print("[bold yellow]Warning ...") in the line where the workers start reading image files into RAM.

nerfstudio/data/datamanagers/base_datamanager.py

nerfstudio/data/utils/data_utils.py

nerfstudio/data/utils/dataloaders.py

…nio/nerfstudio into dataloading-revamp

kerrj · 2025-01-07T21:34:26Z

@jb-ye would be good to get some more eyes on this, its very close to ready to merge, we've tested with external methods, train/test/render/export for nerfacto/splatfacto and we're hoping to merge in the next couple days

nerfstudio/data/datamanagers/base_datamanager.py

…nio/nerfstudio into dataloading-revamp

nerfstudio/data/datasets/base_dataset.py

chungmin99 · 2025-01-08T22:25:54Z

nerfstudio/pipelines/base_pipeline.py

@@ -297,6 +296,9 @@ def get_train_loss_dict(self, step: int):
            step: current iteration step to update sampler if using DDP (distributed)
        """
        ray_bundle, batch = self.datamanager.next_train(step)
+        ray_bundle = ray_bundle.to(
+            self.device
+        )  # for some reason this line of code is needed otherwise viewmats will not be invertible?


Is this comment related to this: pytorch/pytorch#90613?

So this line is addressing a super weird bug I’ve been getting when there is more than >0 workers on my machine in specific instances that I’ve had trouble replicating but comes up every now and then and I apologize for not bringing this up earlier!

The bug: For example when running splatfacto, withindatamanager.next_train() of full_image_datamanager.py, all tensor fields of ray_bundle are correct. For instance I can do print(ray_bundle.camera_to_worlds) and these poses are correct. However within base_pipeline.py, when I add the print statement of the c2w matrices these tensors become 0's, which leads to an error down the stack later in splatfacto.py

I’m going to hold off on merging until I can get this sorted out

On my machine, the way this issue gets addressed is by keeping the ray_bundle or camera TensorDataclass object on the CPU until the object is used in base_pipeline.py.Because all models such as SplatfactoModel or NerfactoModel assume that the ray_bundle or camera is already on the GPU, I am keeping this ray_bundle variable on the CPU until right before it gets inputted into the model to prevent its fields from getting set to 0.

I'm not sure why this happens, it can be a pytorch version issue or something related specific to my machine, but I know some others did not experience this

I see -- I also couldn't replicate the bug from commenting out the line, with the latest up-to-date code.

BTW, are you sure that ray_bundle is on CPU? Personally, I found that raybundle is already on cuda even before we hit ray_bundle = ray_bundle.to(device), so the line is a no-op. It also seems that both RayBatchStream and ImageBatchStream move all data to the input device in the __iter__ calls?

jb-ye · 2025-01-09T02:51:37Z

@jb-ye would be good to get some more eyes on this, its very close to ready to merge, we've tested with external methods, train/test/render/export for nerfacto/splatfacto and we're hoping to merge in the next couple days

@f-dy might be interested to take a look since it is a major change impacting the data workflow.

…nio/nerfstudio into dataloading-revamp

…ir custom datamanagers to support new features

pwais · 2025-01-09T23:05:19Z

Very cool to see this moving along!! Congrats @AntonioMacaronio !! What sort of datasets sizes have you tested so far, like 500 4k images for nerfacto / depth-nerfacto as well as splatfacto? I would be curious to test now that it's closer to launch!

AntonioMacaronio and others added 16 commits June 11, 2024 15:03

initial debugging and testing works

0471543

pwais changes with RayBatchStream to alleviate training

c6dde7d

Merge branch 'main' into dataloading-revamp

a09ea0c

few bugs to iron out with multiprocessing, specifically pickled colla…

78453cd

…te_fn

working version of RayBatchStream

f2bd96f

additional docstrings

d8b7430

cleanup

a5425d4

much more documentation

604f734

successfully trained AEA-script2_seq2 closed_loop without OOM

0143803

porting over aria dataset-size feature

d3527e2

added logic to handle eviction of a worker's cached_collated_batch

25f5f27

antonio's implementation of stream batches

3a8b63b

training on a dataset with 4000 images works!

536c6ca

some configuration speedups, loops aren't actually needed!

43a0061

quick fix adjustment to aria

fa7cf30

removed unnecessary looping

927cb6a

pwais reviewed Jun 17, 2024

View reviewed changes

AntonioMacaronio added 11 commits June 25, 2024 07:22

much faster training when adding i variable to collate every 5 ray bu…

814f2c2

…ndles

cleanup unnecssary variables in Dataloader

247ac3e

further cleanup

55d0803

adding caching of compressed images to RAM to reduce disk bottleneck

b6979a4

added caching to RAM for masks

81dbf7c

found fast way to collate - many tricks applied

55ca71d

quick update to aria to test on different datasets

3b4f091

cleaned up the accelerated pil_to_numpy function

7de1922

cleaning up PR

9ceaad1

this commit was used to generate the time metrics and profiling metrics

4147a6a

REAL commit used to run tests

5a55b7a

pwais reviewed Jul 28, 2024

View reviewed changes

testing with nerfacto-big

78f02e6

AntonioMacaronio and others added 17 commits January 3, 2025 19:25

adding description comments

3e06221

Merge branch 'main' into dataloading-revamp

c6f8094

updating description

e72dd78

Merge branch 'dataloading-revamp' of https://github.com/AntonioMacaro…

f5024fc

…nio/nerfstudio into dataloading-revamp

resolving some pyright issues with export.py, explained in PR desc

d2af513

fixing pyright issues in base_pipeline.py

1a02133

ran pyright on exporter and base_pipeline.py without issues

b7bcb13

adding a git ignore to a clearly checked pyright issue

603a5db

typo

eedda79

merge

3c5ab8e

fixing most ns-dev-test cases

2f90812

Merge branch 'dataloading-revamp' of https://github.com/AntonioMacaro…

3a82351

…nio/nerfstudio into dataloading-revamp

cleanup, passing final ns-dev-test

4091694

oops, accidentally pushed the deletion of a docstring, undoing that

e7c99e4

another cleanup

bd6d1ae

some fixes to eval pipeline

deb4d7f

lint

a5f62aa

Merge branch 'main' into dataloading-revamp

97629a7

chungmin99 reviewed Jan 8, 2025

View reviewed changes

nerfstudio/data/datamanagers/base_datamanager.py Show resolved Hide resolved

kerrj added 3 commits January 8, 2025 14:12

add asserts for spawn

e13525e

Merge branch 'dataloading-revamp' of https://github.com/AntonioMacaro…

94afc0b

…nio/nerfstudio into dataloading-revamp

lint

c316a7b

chungmin99 reviewed Jan 8, 2025

View reviewed changes

nerfstudio/data/datasets/base_dataset.py Show resolved Hide resolved

chungmin99 reviewed Jan 8, 2025

View reviewed changes

AntonioMacaronio added 3 commits January 9, 2025 03:24

cleaning up import statements in parallel_datamanager.py

b8da37d

Merge branch 'dataloading-revamp' of https://github.com/AntonioMacaro…

3fafbc2

…nio/nerfstudio into dataloading-revamp

adding new developer documentation if users would like to migrate the…

e4a7661

…ir custom datamanagers to support new features

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloading Revamp #3216

Dataloading Revamp #3216

AntonioMacaronio commented Jun 12, 2024 •

edited

Loading

pwais left a comment

pwais left a comment

kerrj commented Jan 7, 2025

chungmin99 Jan 8, 2025

AntonioMacaronio Jan 9, 2025

AntonioMacaronio Jan 9, 2025 •

edited

Loading

chungmin99 Jan 9, 2025

jb-ye commented Jan 9, 2025

pwais commented Jan 9, 2025

Dataloading Revamp #3216

Are you sure you want to change the base?

Dataloading Revamp #3216

Conversation

AntonioMacaronio commented Jun 12, 2024 • edited Loading

pwais left a comment

Choose a reason for hiding this comment

pwais left a comment

Choose a reason for hiding this comment

kerrj commented Jan 7, 2025

chungmin99 Jan 8, 2025

Choose a reason for hiding this comment

AntonioMacaronio Jan 9, 2025

Choose a reason for hiding this comment

AntonioMacaronio Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

chungmin99 Jan 9, 2025

Choose a reason for hiding this comment

jb-ye commented Jan 9, 2025

pwais commented Jan 9, 2025

AntonioMacaronio commented Jun 12, 2024 •

edited

Loading

AntonioMacaronio Jan 9, 2025 •

edited

Loading