Submitted by mostlyhydrogen t3_10rvkru in MachineLearning

Are there tools or techniques that permit you to joint query using more than one query vector?

Use case: iterative ANN search refinement, where I start with a seed vector, select matches, and re-query with more examples to improve the search results.

I tried doing this with FAISS, but it performs a "batch query" that returns a separate set of results for each query vector (not a joint query).

5

Comments

You must log in or register to comment.

linverlan t1_j70oz53 wrote

You want to query with multiple vectors but don’t want to query with the vectors separately and don’t want to query with the mean of the vectors? You are going to need to give more details about what you want to do then.

1

BiryaniSenpai t1_j70x9mq wrote

Maybe have your vectors attend to each other and learnably output your final query vector?

1

mostlyhydrogen OP t1_j7238p8 wrote

As you probably know, ANN search often returns irrelevant data. How might I iteratively refine the search with human feedback: marking samples as "relevant" or "irrelevant" and repeating the search.

I've done a lit search and haven't found anything, maybe because I am using the wrong keywords.

1

YOLOBOT666 t1_j72cncj wrote

Iterative as in continuing until there’s no more neighbours left as you continuously add neighbours to your index and query?

1

mostlyhydrogen OP t1_j73k4xe wrote

Not exactly. I have millions of points, most of which are not related to my query vectors. I want to iteratively refine my search: search, mark results as "relevant" or "irrelevant", repeat search with updated query.

1

mostlyhydrogen OP t1_j7fxwyx wrote

>ScaNN interface features

Nope. Notice that the results have shape (10000, 20) instead of (20,). That is just doing a batched query i.e. "for each of these 10k input vectors, find me 20 neighbors". What I need is a joint query, i.e. "given these 10k positive examples, give me an additional 20 candidate samples".

2

mostlyhydrogen OP t1_j7fydvb wrote

The goal is to harvest training data for ML. If there is a difficult edge case the model is struggling with, the best way to improve model performance is to harvest additional training data for that edge case. You stop when the model performance meets your requirements.

1

YOLOBOT666 t1_j7iov1k wrote

Nice! I guess the heuristic part is how you use the queries at every iteration and make it “usable” in your iterative approach. What’s the size and dimension of your dataset? These graph-based ANNs are memory intensive, wondering what can you do for your dimensions?

If it’s a public repo/planning to release it on GitHub, I’d be happy to join!

1

mostlyhydrogen OP t1_j7km5j2 wrote

Thanks for the offer! This is a work project, though. I'm working with images. I can't give too many details due to confidentiality, but we're sub-billion images scale.

Usability is determined by trained annotators. If they find an object of interest and want to harvest more training data, they do a reverse image search across the whole training data and tag true matches.

1