-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compatible with torch.utils.data.DataLoader
#154
base: main
Are you sure you want to change the base?
Conversation
Found a solution for the shuffling problem. Not perfect but I can't think of anything better. Test: import time
import numpy as np
from pix2tex.dataset.dataset import Im2LatexDataset
from pix2tex.utils import *
from tqdm import tqdm
from torch.utils.data import DataLoader
from hashlib import blake2b
dataset = Im2LatexDataset().load("mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)
def tostr(tokens):
return post_process(token2str(tokens, dataset.tokenizer)[0])
if __name__ == '__main__':
start_time = time.time()
for e in range(0, 2):
seqhashs = []
pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
for i, (seq, im) in pbar:
seqhashs.extend([blake2b(bytes(tostr(s), encoding='utf-8'),
digest_size=8).hexdigest() for s in seq['input_ids']])
dataset._shuffle() #Is necessary now for shuffling. Not perfect.
print('Only unique samples?', len(seqhashs) == np.unique(seqhashs).shape[0])
print("="*50)
end_time = time.time()
# print with 3 decimal places
print("Time taken: %.3f seconds" % (end_time - start_time)) |
I think that's good enough, maybe write a class to wrap one small thing I want to point out in your test code is that |
Yes, you're right. We can write a custom Dataloader class as well that could handle most things internally.
Thanks for pointing that out. I only wanted to check if I did no mistake when coding and make sure all samples are being served. |
can't wait to use the new data loader, do we directly merge it or write a custom |
I would inherit from the torch dataloader and overwrite the |
like this class Dataloader(DataLoader):
def __init__(self, dataset: Im2LatexDataset, batch_size=1, shuffle=False, *args, **kwargs):
self.dataset = dataset
self.dataset.update(batchsize=batch_size, shuffle=shuffle)
super().__init__(self.dataset, *args, shuffle=False, batch_size=None, **kwargs)
def __iter__(self):
self.dataset._shuffle()
return super().__iter__() That way we can call the data loader similar to the vanilla torch DataLoader: dataset = Im2LatexDataset().load(path)
dataloader = Dataloader(dataset, batch_size=10, shuffle=True, num_workers=4) Feel free to go from here. |
Next steps would be to add Also check if |
Got it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Only one small comment (3 comments are related)
I think it would be nicer to separate the dataset and dataloader a little bit, as it is done in pytorch.
GPU6-new version: gpu_devices: [6] #[0,1,2,3,4,5,6,7]
num_workers: 20
... GPU7-previous version gpu_devices: [7] #[0,1,2,3,4,5,6,7]
...
same config training time compare BLEU: 0.044, ED: 3.46e+00, ACC: 0.035: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224: 0%| | 0/301 [00:00<?, ?it/s]
Loss: 0.3471: 60%|███████████████████████████████████████████████████████████████████████████████████████████▌ | 4648/7767 [08:54<04:58, 10.44it/s]
GPU7 group BLEU: 0.017, ED: 7.88e+00, ACC: 0.012: 0%| | 0/389 [00:08<?, ?it/s]
BLEU: 0.169, ED: 1.08e+00, ACC: 0.059: 0%| | 0/389 [00:07<?, ?it/s]
Loss: 0.6331: 30%|_______________________ | 2396/7891 [09:34<19:28, 4.70it/s] Thanks for your code review, I will respond after the test is over. Only have the last
|
class Dataloader(DataLoader):
def __init__(self, dataset: Im2LatexDataset, shuffle=False, *args, **kwargs):
self.dataset = dataset
self.tokenizer = dataset.tokenizer
self.dataset.update(shuffle=shuffle, *args, **kwargs)
print("Here is the Dataloader class", kwargs.get('pin_memory',False))# to verify pin_memory work normally
super().__init__(self.dataset,
*args,
shuffle=False,
batch_size=None,
num_workers=kwargs.get('num_workers', 0),
pin_memory=kwargs.get('pin_memory', False)) shell output Here is the Dataloader class True
Here is the Dataloader class True
BLEU: 0.044, ED: 3.46e+00, ACC: 0.035: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.433, ED: 6.02e-01, ACC: 0.267: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.517, ED: 4.16e-01, ACC: 0.280: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.552, ED: 4.09e-01, ACC: 0.341: 0%| | 0/301 [00:01<?, ?it/s]
Loss: 0.2323: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 7752/7767 [14:17<00:01, 9.04it/s]
BLEU: 0.575, ED: 4.12e-01, ACC: 0.335: 3%|████▎ | 10/301 [00:20<10:04, 2.08s/it]
Loss: 0.2353: 16%|████████████████████████▎ | 1232/7767 [02:46<13:28, 8.09it/s] |
Better split between the dataloader and dataset classes
I've implemented it the way I was talking about. Feel free to comment. |
|
|
That's weird. Something has to be off here. It should perform very similarly. I don't expect the same curves to before, even with the same seed but that is not a good sign. Was the grey line at least faster? Can you set the x axis to wall time for comparison? |
So none of them are with the new dataloader setup? |
The grey one with |
Test data at this #154 (comment) is under With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch. It's my comprehension. I haven't more GPU to reproduce the phenomenon these days. after this run is finished, I will try again the latest commit in this PR. |
I can't follow your calculation |
The purple one is dbf75d9, the grey one is based on 9c6f96e (only a little modify about tokenizer or args, kwargs stuff). The grey one's code is same with #154 (comment) GPU6-new version. I will verify the latest commit in the PR to reproduce again so that we can know if that's my local code error or practical phenomenon. |
The calculation is not rigorous, maybe is wrong. I just want to try to explain the speedup gap phenomenon. Hypothesis num_worker=35 means 35 times speedup. When the progress's most of the time at CPU, then we speedup 35 times than previous, the effect is obvious. When the progress most of the time spend at GPU, only a few portions at CPU, then 35 times speedup doesn't have much meaning. |
pix2tex-vit-5.27.11.00
pix2tex-vit-5.25.14.08
pix2tex-vit-5.24.23.09
in the view of elative time at 14 hours, pix2tex-vit-5.27.11.00 achieved epoch 23, pix2tex-vit-5.25.14.08 achieved epoch 16~17. and the val/bleu speedup group curves trend more closely, so it's not a coincidence. |
well that is not very good. I don't know what is going wrong here. It has to do with the new dataloader right?
|
Sure, that hash is a misleading message. the wandb run id is 10uj5emo. furthermore, check the pix2tex version installed in (base) root@x:/yuhang# conda activate latex_ocr
(latex_ocr) root@x:/yuhang# pip list|grep pix2tex
pix2tex 0.0.26 /yuhang/LaTeX-OCR-TITC
(latex_ocr) root@x:/yuhang# cd ../yuhang/LaTeX-OCR-TITC/
(latex_ocr) root@x:/yuhang/LaTeX-OCR-TITC# git rev-parse HEAD
b90567d07f9c0225fe49b626ddcbf20731050b12 The reason why the logged hash is dbf75d9 is due to I run it in another directory, which is your project. (latex_ocr) root@x:/yuhang# cd LaTeX-OCR
(latex_ocr) root@x:/yuhang/LaTeX-OCR# git rev-parse HEAD
dbf75d97f5d256ad3eae130ea6688bf2396df18d
(latex_ocr) root@x:/yuhang/LaTeX-OCR# conda activate pix2tex
(pix2tex) root@x:/yuhang/LaTeX-OCR# pip list|grep pix2tex
pix2tex 0.0.26 /yuhang/LaTeX-OCR I have two copies, one named |
I have some clue and will keep track with it. The new version only inherited a parent class or wrapped by a class, it shouldn't change the batchsize number. In the older version. one epoch has 2685 steps or batches in total. Loss: 0.0162: 52%|████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 1393/2685 [26:36<17:12, 1.25it/s] but the new version has 2560 steps in total/epoch Loss: 0.4345: 63%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 1619/2560 [24:44<11:05, 1.41it/s] I don't think that makes sense. The
And the scale of edit distance curve fluctuation is more severe at the beginning. |
I'm just curious, why not just implement the steps of reading data using the If Im2LatexDataset is implemented based on |
This PR should be submitted at 5 hours ago, but something weird happened.
when I try to through print image path to verify my modification is correct, I found the same path printed 2-4 times in the log file. So I think there may be something wrong with my code and try to figure out it.
And the reason is each subprocess will execute iter(self) method, then re-shuffle again.
LaTeX-OCR/pix2tex/dataset/dataset.py
Lines 89 to 99 in 9f974c9
One image path shuffled multi-times, each time at different position and then sliced in different subprocess.
I also try to set shuffle parameter to dataloader, but got below error.
I think the only way to shuffle each epoch is not shuffle all dataset, but at sub-process internal
self.pairs
.Feel free to modify any inappropriate part. ^_^
Here is the test code.