Summ¶
Intelligent question-answering and search for user interviews, powered by GPT-3.
Demo¶
How it works¶
Summ starts with a corpus of user interview transcripts. These can be in any text format, such as exports from Otter.ai.
We flow these through a pipeline: Import -> Split -> Classify | Factify | Structure | Summarize -> Embed to create a model which can answer questions across your entire dataset. Vector embeddings are persisted to Pinecone.
Finally, we enable flexible Querying, following a recursive question-answering scheme.
Check out this blog post for more details.
Requirements¶
You'll need an instance of Redis Stack running.
$ brew install --cask redis-stack/redis-stack/redis-stack-server
$ brew install yasyf/summ/redis-stack
$ brew services start yasyf/summ/redis-stack
You'll also need to set three environment variables: OPENAI_API_KEY
, PINECONE_API_KEY
, and PINECONE_ENVIRONMENT
.
Installation¶
The easiest installation uses pip
:
$ pip install summ
If you prefer to use brew
:
$ brew install yasyf/summ/summ
n.b summ
requires Python 3.10+.
Demo¶
You can confirm that summ
installed properly by running the built-in example.
$ summ-example
Quickstart¶
This quickstart is taken straight from the otter.ai example
.
Setup¶
First, create a new project with:
$ summ init /path/to/project
$ cd /path/to/project
Tags¶
The class MyClasses
in implementation/classes.py
sets out one categories of tags: audio source.
from enum import StrEnum, auto
from summ.classify import Classes
class MyClasses(Classes, StrEnum):
# SOURCE
SOURCE_PODCAST = auto()
SOURCE_INTERVIEW = auto()
SOURCE_RADIO = auto()
Classifiers¶
The classifiers in implementation/classifier.py
use simple parameters to define a prompt for each category of tags. It is normally sufficient to simply provide CATEGORY
, VARS
, and EXAMPLES
. You may also optionally specify a PREFIX
or SUFFIX
for the prompt.
from textwrap import dedent
from typing_extensions import override
from summ.classify.classifier import Classifier, Document
from .classes import MyClasses
class TypeClassifier(Classifier, classes=MyClasses):
CATEGORY = "SOURCE"
VARS = {
"opening": "Opening Remarks",
"source": "Audio Source",
}
EXAMPLES = [
{
"opening": "Welcome to radio one!",
"source": MyClasses.SOURCE_RADIO,
},
{
"opening": "This is the latest episode of the Science podcast.",
"source": MyClasses.SOURCE_PODCAST,
},
{
"opening": "We're sitting down today with Ben.",
"source": MyClasses.SOURCE_INTERVIEW,
},
]
SUFFIX = f"If someone is being interviewd, the class is always {MyClasses.SOURCE_INTERVIEW}, even if the medium matches a different class."
@override
def classify(self, docs: list[Document]) -> dict[str, str]:
return {"opening": docs[0].page_content}
CLI¶
Finally, in implementation/__init__.py
, we:
- Ensure our classifiers are imported
- Construct a
Summ
object, passing aPath
to our training data. - Construct a custom
Pipeline
object which specifies the otter.ai import format. - Pass these two to
summ.CLI
, which creates a command line interface for us.
from pathlib import Path
from summ import Pipeline, Summ
from summ.cli import CLI
from summ.splitter.otter import OtterSplitter
if __name__ == "__main__":
summ = Summ(index="cronutt-facts")
path = Path(__file__).parent.parent / "interviews"
pipe = Pipeline.default(path, summ.index)
pipe.splitter = OtterSplitter(
speakers_to_exclude=[
"Cindy Buckmaster",
"Michelle Greenfield",
"Vivica",
"Deanna",
]
)
CLI.run(summ, pipe)
Usage¶
TUI¶
To run the Terminal UI, simply do:
$ python -m implementation
You can also run the steps non-interactively, as shown below.
Populate¶
Now, to populate our model, we can do:
$ python -m implementation populate
Query¶
And to query it:
$ python -m implementation query "What kind of animal is Cronutt?"
Cronutt is a California sea lion, a species of marine mammal.