Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Performance Evaluation Result is significantly different than paper gIoU:60.0 cIoU:67.8 vs gIoU:0.0759 cIoU:0.0648 #103

Open
ssablak opened this issue Jan 10, 2024 · 3 comments

Comments

@ssablak
Copy link

ssablak commented Jan 10, 2024

Hi @X-Lai, @tianzhuotao, @yukang2017, @yanwei-li , @xbkaishui
I am evaluating your method with your provided model (xinlai/LISA-13B-llama2-v1) as a part of my research studies. I see the significant difference between your paper and the evaluation result xinlai/LISA-13B-llama2-v1 by utilizing your --eval_only flag.

Screenshot from 2024-01-09 18-51-10
your paper reports as below:
LISA-Llama2-13B (ft) gIoU: 60.0 cIoU: 67.8

When I run your source code with --eval_only
LISA-Llama2-13B (ft) gIoU: 0.0759 cIoU: 0.0648

Could you please help me with what I am missing? or what is the best way to evaluate your released model?
Thanks,
-Steve

(test) ssablak@ssablak-5820u:/media/$ python train_ds.py --version="/media/models/LISA-13B-llama2-v1" --dataset_dir="/media/dataset" --vision_pretrained="/media/models/SAM/sam_vit_h_4b8939.pth" --exp_name="LISA-13B-llama2-v1" --eval_only

[2024-01-09 18:49:34,938] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:44<00:00, 14.91s/it]
trainable params: 6,553,600 || all params: 13,999,497,520 || trainable%: 0.04681310876077786
n: base_model.model.model.embed_tokens.weight p.shape: torch.Size([32003, 5120])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.q_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.q_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.k_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.k_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.v_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.v_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.out_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.self_attn.out_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm1.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm1.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.q_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.q_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.k_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.k_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.v_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.v_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.out_proj.weight p.shape: torch.Size([256, 128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_token_to_image.out_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm2.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm2.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.mlp.lin1.weight p.shape: torch.Size([2048, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.mlp.lin1.bias p.shape: torch.Size([2048])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.mlp.lin2.weight p.shape: torch.Size([256, 2048])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.mlp.lin2.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm3.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm3.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm4.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.norm4.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.q_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.q_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.k_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.k_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.v_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.v_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.out_proj.weight p.shape: torch.Size([256, 128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.0.cross_attn_image_to_token.out_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.q_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.q_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.k_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.k_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.v_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.v_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.out_proj.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.self_attn.out_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm1.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm1.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.q_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.q_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.k_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.k_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.v_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.v_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.out_proj.weight p.shape: torch.Size([256, 128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_token_to_image.out_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm2.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm2.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.mlp.lin1.weight p.shape: torch.Size([2048, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.mlp.lin1.bias p.shape: torch.Size([2048])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.mlp.lin2.weight p.shape: torch.Size([256, 2048])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.mlp.lin2.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm3.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm3.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm4.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.norm4.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.q_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.q_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.k_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.k_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.v_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.v_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.out_proj.weight p.shape: torch.Size([256, 128])
n: base_model.model.model.visual_model.mask_decoder.transformer.layers.1.cross_attn_image_to_token.out_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.q_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.q_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.k_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.k_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.v_proj.weight p.shape: torch.Size([128, 256])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.v_proj.bias p.shape: torch.Size([128])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.out_proj.weight p.shape: torch.Size([256, 128])
n: base_model.model.model.visual_model.mask_decoder.transformer.final_attn_token_to_image.out_proj.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.norm_final_attn.weight p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.transformer.norm_final_attn.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.iou_token.weight p.shape: torch.Size([1, 256])
n: base_model.model.model.visual_model.mask_decoder.mask_tokens.weight p.shape: torch.Size([4, 256])
n: base_model.model.model.visual_model.mask_decoder.output_upscaling.0.weight p.shape: torch.Size([256, 64, 2, 2])
n: base_model.model.model.visual_model.mask_decoder.output_upscaling.0.bias p.shape: torch.Size([64])
n: base_model.model.model.visual_model.mask_decoder.output_upscaling.1.weight p.shape: torch.Size([64])
n: base_model.model.model.visual_model.mask_decoder.output_upscaling.1.bias p.shape: torch.Size([64])
n: base_model.model.model.visual_model.mask_decoder.output_upscaling.3.weight p.shape: torch.Size([64, 32, 2, 2])
n: base_model.model.model.visual_model.mask_decoder.output_upscaling.3.bias p.shape: torch.Size([32])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.0.layers.0.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.0.layers.0.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.0.layers.1.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.0.layers.1.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.0.layers.2.weight p.shape: torch.Size([32, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.0.layers.2.bias p.shape: torch.Size([32])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.1.layers.0.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.1.layers.0.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.1.layers.1.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.1.layers.1.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.1.layers.2.weight p.shape: torch.Size([32, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.1.layers.2.bias p.shape: torch.Size([32])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.2.layers.0.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.2.layers.0.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.2.layers.1.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.2.layers.1.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.2.layers.2.weight p.shape: torch.Size([32, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.2.layers.2.bias p.shape: torch.Size([32])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.3.layers.0.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.3.layers.0.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.3.layers.1.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.3.layers.1.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.3.layers.2.weight p.shape: torch.Size([32, 256])
n: base_model.model.model.visual_model.mask_decoder.output_hypernetworks_mlps.3.layers.2.bias p.shape: torch.Size([32])
n: base_model.model.model.visual_model.mask_decoder.iou_prediction_head.layers.0.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.iou_prediction_head.layers.0.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.iou_prediction_head.layers.1.weight p.shape: torch.Size([256, 256])
n: base_model.model.model.visual_model.mask_decoder.iou_prediction_head.layers.1.bias p.shape: torch.Size([256])
n: base_model.model.model.visual_model.mask_decoder.iou_prediction_head.layers.2.weight p.shape: torch.Size([4, 256])
n: base_model.model.model.visual_model.mask_decoder.iou_prediction_head.layers.2.bias p.shape: torch.Size([4])
n: base_model.model.model.text_hidden_fcs.0.0.weight p.shape: torch.Size([5120, 5120])
n: base_model.model.model.text_hidden_fcs.0.0.bias p.shape: torch.Size([5120])
n: base_model.model.model.text_hidden_fcs.0.2.weight p.shape: torch.Size([256, 5120])
n: base_model.model.model.text_hidden_fcs.0.2.bias p.shape: torch.Size([256])
n: base_model.model.lm_head.weight p.shape: torch.Size([32003, 5120])
ade20k: 20210
cocostuff: 118287
loading annotations into memory...
Done (t=0.70s)
creating index...
index created!
pascal_part: 4366
loading annotations into memory...
Done (t=9.70s)
creating index...
index created!
paco_lvis: 45790
mapillary: 18000
loading dataset refclef into memory...
ref_file: /media/ssablak/mogata/dataset/refer_seg/refclef/refs(unc).p
creating index...
index created.
DONE (t=3.71s)
dataset refclef (refs unc) (train split) has 17978 images and 99523 annotations.
loading dataset refcoco into memory...
ref_file: /media/ssablak/mogata/dataset/refer_seg/refcoco/refs(unc).p
creating index...
index created.
DONE (t=6.99s)
dataset refcoco (refs unc) (train split) has 16994 images and 196771 annotations.
loading dataset refcoco+ into memory...
ref_file: /media/ssablak/mogata/dataset/refer_seg/refcoco+/refs(unc).p
creating index...
index created.
DONE (t=5.24s)
dataset refcoco+ (refs unc) (train split) has 16992 images and 196737 annotations.
loading dataset refcocog into memory...
ref_file: /media/ssablak/mogata/dataset/refer_seg/refcocog/refs(umd).p
creating index...
index created.
DONE (t=7.61s)
dataset refcocog (refs umd) (train split) has 21899 images and 208960 annotations.
vqa_data: 157712
number of reason_seg samples: 239
len(self.img_to_explanation): 239
Training with 10000 examples and validating with 200 examples.
[2024-01-09 18:51:34,030] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2024-01-09 18:51:34,031] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-09 18:51:34,031] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-01-09 18:51:35,835] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.5.132, master_port=29500
[2024-01-09 18:51:35,836] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ssablak-5820u]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [ssablak-5820u]:29500 (errno: 97 - Address family not supported by protocol).
[2024-01-09 18:51:41,974] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/ssablak/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ssablak/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 1.52187180519104 seconds
[2024-01-09 18:51:44,440] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2024-01-09 18:51:44,440] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-01-09 18:51:44,753] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-01-09 18:51:44,753] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-01-09 18:51:44,753] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-01-09 18:51:44,753] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 500000000
[2024-01-09 18:51:44,753] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 500000000
[2024-01-09 18:51:44,753] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False
[2024-01-09 18:51:44,753] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False
[2024-01-09 18:51:48,972] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-01-09 18:51:48,973] [INFO] [utils.py:792:see_memory_usage] MA 27.64 GB Max_MA 28.32 GB CA 28.46 GB Max_CA 28 GB
[2024-01-09 18:51:48,973] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 21.61 GB, percent = 17.2%
[2024-01-09 18:51:52,356] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
[2024-01-09 18:51:52,357] [INFO] [utils.py:792:see_memory_usage] MA 30.36 GB Max_MA 31.73 GB CA 32.55 GB Max_CA 33 GB
[2024-01-09 18:51:52,357] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 21.61 GB, percent = 17.2%
[2024-01-09 18:51:52,357] [INFO] [stage_1_and_2.py:516:init] optimizer state initialized
[2024-01-09 18:51:55,766] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
[2024-01-09 18:51:55,767] [INFO] [utils.py:792:see_memory_usage] MA 30.36 GB Max_MA 30.36 GB CA 32.55 GB Max_CA 33 GB
[2024-01-09 18:51:55,767] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 21.61 GB, percent = 17.2%
[2024-01-09 18:51:55,775] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2024-01-09 18:51:55,775] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2024-01-09 18:51:55,775] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fcf729ff460>
[2024-01-09 18:51:55,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0003], mom=[(0.9, 0.95)]
[2024-01-09 18:51:55,778] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] amp_enabled .................. False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] amp_params ................... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] bfloat16_enabled ............. False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fcf729e5fa0>
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] communication_data_type ...... None
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] dataloader_drop_last ......... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] disable_allgather ............ False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] dump_state ................... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] elasticity_enabled ........... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] fp16_auto_cast ............... False
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] fp16_enabled ................. True
[2024-01-09 18:51:55,779] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] global_rank .................. 0
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] grad_accum_dtype ............. None
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] gradient_accumulation_steps .. 10
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] graph_harvesting ............. False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] initial_dynamic_scale ........ 65536
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] load_universal_checkpoint .... False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] loss_scale ................... 0
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] memory_breakdown ............. False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] mics_shard_size .............. -1
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] optimizer_name ............... adamw
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] optimizer_params ............. {'lr': 0.0003, 'weight_decay': 0.0, 'betas': (0.9, 0.95)}
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] pld_enabled .................. False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] pld_params ................... False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] prescale_gradients ........... False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] scheduler_name ............... WarmupDecayLR
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] scheduler_params ............. {'total_num_steps': 5000, 'warmup_min_lr': 0, 'warmup_max_lr': 0.0003, 'warmup_num_steps': 100, 'warmup_type': 'linear'}
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] sparse_attention ............. None
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] steps_per_print .............. 10
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] train_batch_size ............. 20
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 2
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] use_node_local_storage ....... False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] weight_quantization_config ... None
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] world_size ................... 1
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] zero_allow_untested_optimizer False
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] zero_enabled ................. True
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
[2024-01-09 18:51:55,780] [INFO] [config.py:988:print] zero_optimization_stage ...... 2
[2024-01-09 18:51:55,781] [INFO] [config.py:974:print_user_config] json = {
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 10,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.0003,
"weight_decay": 0.0,
"betas": [0.9, 0.95]
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": 5.000000e+03,
"warmup_min_lr": 0,
"warmup_max_lr": 0.0003,
"warmup_num_steps": 100,
"warmup_type": "linear"
}
},
"fp16": {
"enabled": true
},
"bf16": {
"enabled": false
},
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"allgather_bucket_size": 5.000000e+08
}
}
<torch.utils.data.dataloader.DataLoader object at 0x7fd109aabca0>
0%| | 0/200 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (757 > 512). Running this sequence through the model will result in indexing errors
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:18<00:00, 2.56it/s]
giou: 0.0759, ciou: 0.0648

@ssablak ssablak changed the title Segmentation Performance Evaluation Result is significantly different than paper Segmentation Performance Evaluation Result is significantly different than paper gIoU:60.0 cIoU:67.8 vs gIoU:0.0759 cIoU:0.0648 Jan 10, 2024
@GuangyanS
Copy link

same issue here

@GaoXiaoshan
Copy link

any update?

@ccccai239
Copy link

I meet the same issue,anybody fixed it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants