compatible with `torch.utils.data.DataLoader` #154

TITC · 2022-05-23T15:47:25Z

This PR should be submitted at 5 hours ago, but something weird happened.

         for im in ims:
             print(im,self.img_list.index(im), len([_ for _ in self.img_list if _ ==im]))

when I try to through print image path to verify my modification is correct, I found the same path printed 2-4 times in the log file. So I think there may be something wrong with my code and try to figure out it.

dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485

And the reason is each subprocess will execute iter(self) method, then re-shuffle again.

LaTeX-OCR/pix2tex/dataset/dataset.py

Lines 89 to 99 in 9f974c9

    
           self.pairs = [] 
        
           for k in self.data: 
        
               info = np.array(self.data[k], dtype=object) 
        
               p = torch.randperm(len(info)) if self.shuffle else torch.arange(len(info)) 
        
               for i in range(0, len(info), self.batchsize): 
        
                   batch = info[p[i:i+self.batchsize]] 
        
                   if len(batch.shape) == 1: 
        
                       batch = batch[None, :] 
        
                   if len(batch) < self.batchsize and not self.keep_smaller_batches: 
        
                       continue 
        
                   self.pairs.append(batch)

One image path shuffled multi-times, each time at different position and then sliced in different subprocess.

self.pairs = self.pairs[self.start:self.end]

I also try to set shuffle parameter to dataloader, but got below error.

Traceback (most recent call last):
  File "../test/test_dataset.py", line 27, in <module>
    dataset, num_workers=8, worker_init_fn=worker_init_fn, shuffle=True)
  File "/home/yhtao/anaconda3/envs/latex_ocr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 236, in __init__
    "shuffle option, but got shuffle={}".format(shuffle))
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

I think the only way to shuffle each epoch is not shuffle all dataset, but at sub-process internal self.pairs.

        self.pairs = self.pairs[self.start:self.end]
        if self.shuffle:
            self.pairs = np.random.permutation(np.array(self.pairs, dtype=object))
        else:
            self.pairs = np.array(self.pairs, dtype=object)
        self.size = len(self.pairs)

Feel free to modify any inappropriate part. ^_^

Here is the test code.

import time
import torch
from pix2tex.dataset.dataset import Im2LatexDataset, worker_init_fn
from pix2tex.utils import *
from tqdm import tqdm
im2latex = Im2LatexDataset()
dataset = im2latex.load(
    "/mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
# pbar = tqdm(enumerate(iter(dataset)), total=len(dataset))
dataloader = torch.utils.data.DataLoader(
    dataset, num_workers=8, worker_init_fn=worker_init_fn)

if __name__ == '__main__':
    start_time = time.time()
    for e in range(0, 2):
        pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
        for i, (seq, im) in pbar:
            if seq is None or im is None:
                continue
        print("="*50)
    end_time = time.time()
    # print with 3 decimal places
    print("Time taken: %.3f seconds" % (end_time - start_time))

lukas-blecher · 2022-05-23T22:20:30Z

Found a solution for the shuffling problem. Not perfect but I can't think of anything better.
Also moved the worker init into the iter function.

Test:

import time
import numpy as np
from pix2tex.dataset.dataset import Im2LatexDataset
from pix2tex.utils import *
from tqdm import tqdm
from torch.utils.data import DataLoader
from hashlib import blake2b

dataset = Im2LatexDataset().load("mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)


def tostr(tokens):
    return post_process(token2str(tokens, dataset.tokenizer)[0])


if __name__ == '__main__':
    start_time = time.time()
    for e in range(0, 2):
        seqhashs = []
        pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
        for i, (seq, im) in pbar:
            seqhashs.extend([blake2b(bytes(tostr(s), encoding='utf-8'),
                                     digest_size=8).hexdigest() for s in seq['input_ids']])
        dataset._shuffle() #Is necessary now for shuffling. Not perfect.
        print('Only unique samples?', len(seqhashs) == np.unique(seqhashs).shape[0])
        print("="*50)
    end_time = time.time()
    # print with 3 decimal places
    print("Time taken: %.3f seconds" % (end_time - start_time))

TITC · 2022-05-24T06:31:09Z

I think that's good enough, maybe write a class to wrap torch.utils.data.DataLoader then call dataset._shuffle() in it is also a method but add many redundant codes, not pretty sure. But I think writing a class to solve shuffle maybe to extravagant.

one small thing I want to point out in your test code is that seq can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.

lukas-blecher · 2022-05-24T09:09:33Z

Yes, you're right. We can write a custom Dataloader class as well that could handle most things internally.

one small thing I want to point out in your test code is that seq can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.

Thanks for pointing that out. I only wanted to check if I did no mistake when coding and make sure all samples are being served.

TITC · 2022-05-24T09:44:19Z

can't wait to use the new data loader, do we directly merge it or write a custom Dataloader class, I can do that as a program practice.

lukas-blecher · 2022-05-24T10:32:50Z

I would inherit from the torch dataloader and overwrite the __iter__ and later __init__ functions to match the functionality of the parent.
in iter: call dataset shuffle
in init: set dataset batchsize, shuffle and call super after that. Something like that
Will quickly test but I don't have much time the next couple days :)

lukas-blecher · 2022-05-24T10:50:47Z

like this

class Dataloader(DataLoader):
    def __init__(self, dataset: Im2LatexDataset, batch_size=1, shuffle=False, *args, **kwargs):
        self.dataset = dataset
        self.dataset.update(batchsize=batch_size, shuffle=shuffle)
        super().__init__(self.dataset, *args, shuffle=False, batch_size=None, **kwargs)

    def __iter__(self):
        self.dataset._shuffle()
        return super().__iter__()

That way we can call the data loader similar to the vanilla torch DataLoader:

dataset = Im2LatexDataset().load(path)
dataloader = Dataloader(dataset, batch_size=10, shuffle=True, num_workers=4)

Feel free to go from here.

lukas-blecher · 2022-05-24T10:56:14Z

Next steps would be to add num_workers to the config file (default 0 is probably best) and provide it in case it is not present in the config args.num_workers=args.get('num_workers', 0) in the utils.parse_args.
Then replace the old dataloader in train and eval.

Also check if pin_memory is working and faster.

TITC · 2022-05-24T11:02:22Z

Got it

lukas-blecher#154 (comment)

lukas-blecher

Nice work. Only one small comment (3 comments are related)
I think it would be nicer to separate the dataset and dataloader a little bit, as it is done in pytorch.

pix2tex/dataset/dataset.py

pix2tex/train.py

TITC · 2022-05-24T13:02:39Z

GPU6-new version:

gpu_devices: [6] #[0,1,2,3,4,5,6,7]
num_workers: 20
...

GPU7-previous version

gpu_devices: [7] #[0,1,2,3,4,5,6,7]
...
same config

training time compare
GPU6 group

BLEU: 0.044, ED: 3.46e+00, ACC: 0.035:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
Loss: 0.3471:  60%|███████████████████████████████████████████████████████████████████████████████████████████▌                                                             | 4648/7767 [08:54<04:58, 10.44it/s]

GPU7 group

BLEU: 0.017, ED: 7.88e+00, ACC: 0.012:   0%|                                                               | 0/389 [00:08<?, ?it/s]
BLEU: 0.169, ED: 1.08e+00, ACC: 0.059:   0%|                                                               | 0/389 [00:07<?, ?it/s]
Loss: 0.6331:  30%|_______________________                                                     | 2396/7891 [09:34<19:28,  4.70it/s]

Thanks for your code review, I will respond after the test is over. Only have the last pin_memory stuff to compare.

Also check if pin_memory is working and faster.

TITC · 2022-05-24T13:55:40Z

pin_memory not accelerate more in this situation.

class Dataloader(DataLoader):
    def __init__(self, dataset: Im2LatexDataset, shuffle=False, *args, **kwargs):
        self.dataset = dataset
        self.tokenizer = dataset.tokenizer
        self.dataset.update(shuffle=shuffle, *args, **kwargs)
        print("Here is the Dataloader class", kwargs.get('pin_memory',False))# to verify pin_memory work normally
        super().__init__(self.dataset,
                         *args,
                         shuffle=False,
                         batch_size=None,
                         num_workers=kwargs.get('num_workers', 0),
                         pin_memory=kwargs.get('pin_memory', False))

shell output

Here is the Dataloader class True
Here is the Dataloader class True

BLEU: 0.044, ED: 3.46e+00, ACC: 0.035:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.433, ED: 6.02e-01, ACC: 0.267:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.517, ED: 4.16e-01, ACC: 0.280:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.552, ED: 4.09e-01, ACC: 0.341:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
Loss: 0.2323: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 7752/7767 [14:17<00:01,  9.04it/s]
BLEU: 0.575, ED: 4.12e-01, ACC: 0.335:   3%|████▎                                                                                                                              | 10/301 [00:20<10:04,  2.08s/it]
Loss: 0.2353:  16%|████████████████████████▎                                                                                                                                | 1232/7767 [02:46<13:28,  8.09it/s]

Better split between the dataloader and dataset classes

lukas-blecher · 2022-05-24T16:32:35Z

I've implemented it the way I was talking about. Feel free to comment.

TITC · 2022-05-25T01:34:47Z

eval needs tokenizer

LaTeX-OCR/pix2tex/eval.py

Lines 55 to 56 in d0251fa

pred = detokenize(dec, dataset.tokenizer)

truth = detokenize(seq['input_ids'], dataset.tokenizer)
pin_memory should work in theorem, because when set pin_memory==False, data is in virtual memory but when it turns True, data is kept in memory pages.
I'm not sure why it's not work in my scenario, maybe need some further config.

lukas-blecher · 2022-05-25T06:23:10Z

Thanks for catching it
Yes probably some more code is needed. See https://pytorch.org/docs/stable/data.html#memory-pinning

TITC · 2022-05-25T06:33:20Z

The gray curve is the new version. All configs are the same. Kinda out of my expectation. The purple curve achieve higher bleu at epoch 5, but the gray one got less bleu even at epoch 20.

lukas-blecher · 2022-05-25T06:51:51Z

That's weird. Something has to be off here. It should perform very similarly. I don't expect the same curves to before, even with the same seed but that is not a good sign.
Is the other run from before or with num_workers=0?

Was the grey line at least faster? Can you set the x axis to wall time for comparison?

TITC · 2022-05-25T07:23:55Z

The purple run is based on dbf75d9. The wall time compare.

~~I thought team will share training log automatically, but it's blank in it. I am trying to find a method share previous log.~~

All run moved to team, Feel free to compare.

Faster 2 epoch time than previous version

lukas-blecher · 2022-05-25T21:05:26Z

So none of them are with the new dataloader setup?

TITC · 2022-05-25T22:59:05Z

So none of them are with the new dataloader setup?

The grey one with num_workers=35. The purple one without. That's the only difference between their‘s config.

TITC · 2022-05-25T23:18:47Z

Test data at this #154 (comment) is under dim=256, in this case GPU utilization is average at 20%. But the other compare,#154 (comment), dim =756, GPU utilization approximately at 60%.

With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.

It's my comprehension.

I haven't more GPU to reproduce the phenomenon these days. after this run is finished, I will try again the latest commit in this PR.

lukas-blecher · 2022-05-26T09:32:59Z

So none of them are with the new dataloader setup?

The grey one with num_workers=35. The purple one without. That's the only difference between their‘s config.

But the grey run is also based on dbf75d9 so the num_workers argument doesn't change anything because it is not implemented yet.

lukas-blecher · 2022-05-26T09:36:08Z

Test data at this #154 (comment) is under dim=256, in this case GPU utilization is average at 20%. But the other compare,#154 (comment), dim =756, GPU utilization approximately at 60%.

With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.

It's my comprehension.

I can't follow your calculation

TITC · 2022-05-26T09:41:43Z

The purple run is based on dbf75d9.

But the grey run is also based on dbf75d9 so the num_workers argument doesn't change anything because it is not implemented yet.

The purple one is dbf75d9, the grey one is based on 9c6f96e (only a little modify about tokenizer or args, kwargs stuff). The grey one's code is same with #154 (comment) GPU6-new version.

I will verify the latest commit in the PR to reproduce again so that we can know if that's my local code error or practical phenomenon.

TITC · 2022-05-26T09:48:51Z

I can't follow your calculation

The calculation is not rigorous, maybe is wrong. I just want to try to explain the speedup gap phenomenon.

Hypothesis num_worker=35 means 35 times speedup. When the progress's most of the time at CPU, then we speedup 35 times than previous, the effect is obvious. When the progress most of the time spend at GPU, only a few portions at CPU, then 35 times speedup doesn't have much meaning.

TITC · 2022-05-27T23:40:45Z

pix2tex-vit-5.27.11.00

num_worker:20
hash: b90567d

pix2tex-vit-5.25.14.08

hash: dbf75d9

pix2tex-vit-5.24.23.09

num_worker:35
hash: 9c6f96e

wandb-graph

in the view of elative time at 14 hours, pix2tex-vit-5.27.11.00 achieved epoch 23, pix2tex-vit-5.25.14.08 achieved epoch 16~17.

and the val/bleu speedup group curves trend more closely, so it's not a coincidence.

lukas-blecher · 2022-05-28T10:27:23Z

well that is not very good. I don't know what is going wrong here. It has to do with the new dataloader right?

pix2tex-vit-5.27.11.00

num_worker:20

hash: b90567d

Are you sure you are on the new version?

TITC · 2022-05-28T15:41:55Z

Sure, that hash is a misleading message.

the wandb run id is 10uj5emo.

based on the id, I can know the conda env name is latex_ocr.

furthermore, check the pix2tex version installed in latex_ocr is as below.

(base) root@x:/yuhang# conda activate latex_ocr
(latex_ocr) root@x:/yuhang# pip list|grep pix2tex
pix2tex                       0.0.26       /yuhang/LaTeX-OCR-TITC
(latex_ocr) root@x:/yuhang# cd ../yuhang/LaTeX-OCR-TITC/
(latex_ocr) root@x:/yuhang/LaTeX-OCR-TITC# git rev-parse HEAD
b90567d07f9c0225fe49b626ddcbf20731050b12

The reason why the logged hash is dbf75d9 is due to I run it in another directory, which is your project.

(latex_ocr) root@x:/yuhang# cd LaTeX-OCR
(latex_ocr) root@x:/yuhang/LaTeX-OCR# git rev-parse HEAD
dbf75d97f5d256ad3eae130ea6688bf2396df18d
(latex_ocr) root@x:/yuhang/LaTeX-OCR# conda activate pix2tex
(pix2tex) root@x:/yuhang/LaTeX-OCR# pip list|grep pix2tex
pix2tex                0.0.26       /yuhang/LaTeX-OCR

I have two copies, one named LaTeX-OCR-TITC is my fork version, and LaTeX-OCR is the upstream repo. Although I run the train in /yuhang/LaTeX-OCR but the conda env is latex_ocr which installed through /yuhang/LaTeX-OCR-TITC and hash is b90567d. Because I run it through -m module not directly run train.py.

TITC · 2022-05-28T15:56:47Z

I have some clue and will keep track with it. The new version only inherited a parent class or wrapped by a class, it shouldn't change the batchsize number.

In the older version. one epoch has 2685 steps or batches in total.

Loss: 0.0162:  52%|████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                             | 1393/2685 [26:36<17:12,  1.25it/s]

but the new version has 2560 steps in total/epoch

Loss: 0.4345:  63%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                           | 1619/2560 [24:44<11:05,  1.41it/s]

I don't think that makes sense. The Im2LatexDataset supposed to control all thing related to batchsize and the number of batches, new data loader should only change the number of CPU process.

~~You can do your stuff, I will try to figure out it. These days because GPU shortage, I am trying to implement beam search.....This PR also in my doing list.~~

And the scale of edit distance curve fluctuation is more severe at the beginning.

SWHL · 2023-11-02T02:28:00Z

I'm just curious, why not just implement the steps of reading data using the torch.utils.data.Dataset, but instead use the Im2LatexDataset?

If Im2LatexDataset is implemented based on torch.utils.data.Dataset, torch.utils.data.DataLoader can be used directly.

TITC and others added 3 commits May 23, 2022 22:14

compatible with torch.utils.data.DataLoader

39cbf95

fix bugs -shuffle multi-times at subprocess

26583af

worker_init_fn -> iter, workaround for shuffle

cf2832d

Update dataset.py

da29abb

small modify

9c6f96e

lukas-blecher#154 (comment)

lukas-blecher reviewed May 24, 2022

View reviewed changes

pix2tex/dataset/dataset.py Outdated Show resolved Hide resolved

pix2tex/train.py Outdated Show resolved Hide resolved

pix2tex/train.py Outdated Show resolved Hide resolved

pix2tex/train.py Outdated Show resolved Hide resolved

dataloader specifies batch size & shuffle toggle

20c0220

Better split between the dataloader and dataset classes

eval need tokenizer& add pin_memory

b90567d

lukas-blecher added the enhancement New feature or request label Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compatible with `torch.utils.data.DataLoader` #154

compatible with `torch.utils.data.DataLoader` #154

TITC commented May 23, 2022

lukas-blecher commented May 23, 2022 •

edited

Loading

TITC commented May 24, 2022

lukas-blecher commented May 24, 2022

TITC commented May 24, 2022 •

edited

Loading

lukas-blecher commented May 24, 2022

lukas-blecher commented May 24, 2022

lukas-blecher commented May 24, 2022

TITC commented May 24, 2022

lukas-blecher left a comment

TITC commented May 24, 2022 •

edited

Loading

TITC commented May 24, 2022

lukas-blecher commented May 24, 2022

TITC commented May 25, 2022

lukas-blecher commented May 25, 2022

TITC commented May 25, 2022 •

edited

Loading

lukas-blecher commented May 25, 2022

TITC commented May 25, 2022 •

edited

Loading

lukas-blecher commented May 25, 2022

TITC commented May 25, 2022

TITC commented May 25, 2022 •

edited

Loading

lukas-blecher commented May 26, 2022

lukas-blecher commented May 26, 2022

TITC commented May 26, 2022 •

edited

Loading

TITC commented May 26, 2022

TITC commented May 27, 2022

lukas-blecher commented May 28, 2022

TITC commented May 28, 2022

TITC commented May 28, 2022 •

edited

Loading

SWHL commented Nov 2, 2023 •

edited

Loading

	self.pairs = []
	for k in self.data:
	info = np.array(self.data[k], dtype=object)
	p = torch.randperm(len(info)) if self.shuffle else torch.arange(len(info))
	for i in range(0, len(info), self.batchsize):
	batch = info[p[i:i+self.batchsize]]
	if len(batch.shape) == 1:
	batch = batch[None, :]
	if len(batch) < self.batchsize and not self.keep_smaller_batches:
	continue
	self.pairs.append(batch)

compatible with torch.utils.data.DataLoader #154

Are you sure you want to change the base?

compatible with torch.utils.data.DataLoader #154

Conversation

TITC commented May 23, 2022

lukas-blecher commented May 23, 2022 • edited Loading

TITC commented May 24, 2022

lukas-blecher commented May 24, 2022

TITC commented May 24, 2022 • edited Loading

lukas-blecher commented May 24, 2022

lukas-blecher commented May 24, 2022

lukas-blecher commented May 24, 2022

TITC commented May 24, 2022

lukas-blecher left a comment

Choose a reason for hiding this comment

TITC commented May 24, 2022 • edited Loading

TITC commented May 24, 2022

lukas-blecher commented May 24, 2022

TITC commented May 25, 2022

lukas-blecher commented May 25, 2022

TITC commented May 25, 2022 • edited Loading

lukas-blecher commented May 25, 2022

TITC commented May 25, 2022 • edited Loading

lukas-blecher commented May 25, 2022

TITC commented May 25, 2022

TITC commented May 25, 2022 • edited Loading

lukas-blecher commented May 26, 2022

lukas-blecher commented May 26, 2022

TITC commented May 26, 2022 • edited Loading

TITC commented May 26, 2022

TITC commented May 27, 2022

lukas-blecher commented May 28, 2022

TITC commented May 28, 2022

TITC commented May 28, 2022 • edited Loading

SWHL commented Nov 2, 2023 • edited Loading

compatible with `torch.utils.data.DataLoader` #154

compatible with `torch.utils.data.DataLoader` #154

lukas-blecher commented May 23, 2022 •

edited

Loading

TITC commented May 24, 2022 •

edited

Loading

TITC commented May 24, 2022 •

edited

Loading

TITC commented May 25, 2022 •

edited

Loading

TITC commented May 25, 2022 •

edited

Loading

TITC commented May 25, 2022 •

edited

Loading

TITC commented May 26, 2022 •

edited

Loading

TITC commented May 28, 2022 •

edited

Loading

SWHL commented Nov 2, 2023 •

edited

Loading