network meteor Transformer
network based PyTorch implementation for attention-head activation.
- Input
- 3685-dim embedding
- Encoder
- 49 x Transformer with 24 heads
- Output
- f1 projection
Training config
optimizer=SGD, lr=0.628, scheduler=linear, warmup=1999