Submitted by inFamous_16 t3_11hq1ga in deeplearning

I have tweets and the task is to perform text classification. I already have learned token embeddings for those tokens present in each of the tweets through some Graph based NN model. Now that I want to use those token embeddings to represent the tweet but the issue is every tweet will have different size embeddings if I just do concatenation. Is there any way, where I can input variable length embeddings to pre-trained BioBERT (if not, any other BERT) model and still be able to perform classification task?

1

Comments

You must log in or register to comment.

Jaffa6 t1_jav3gj2 wrote

I believe that you don't really lose the context because you also have an attention mask which basically says "don't pay attention to these tokens" and every pad token is masked in it.

1

inFamous_16 OP t1_jav6112 wrote

Ahhh... thank you! I wasn't aware of the concept attention mask. Also I had one more doubt, As I already have tweet features of variable size after concatenation, Is there a way to skip the tokenization step because I don't require it? I only need padding and attention mask.

2

Jaffa6 t1_javl6ef wrote

No problem.

I believe that if you're using a BERT-esque model, you do indeed need to do "full" tokenisation (part of which is creating the attention mask and padding) because BERT expects its input to be a list of token indices. E.g. Given the token mapping {"a": 1, "cow": 2, "cat": 3, "dog": 4}, tokenisation would turn "a cat" into [1, 3] which is in the form that BERT expects.

And since BERT comes with a token mapping (due to pre-training), if you're just putting in your own features (say, number of likes and number of retweets), they'll quite possibly just get interpreted as random tokens if their numbers match up with known token indices.

If your features are already the right kind (tokenised text, with the resultant indices matching the correct BERT token indices), I suppose you could do truncation/padding yourself and feed that input directly to BERT.

But it'll probably end up simpler and less error-prone to let BERT tokenise it for you (e.g. via HuggingFace's `AutoTokenizer.from_pretrained('bert-base')`)

2