diff --git a/.gitignore b/.gitignore index cc94bff3..ef1eca94 100644 --- a/.gitignore +++ b/.gitignore @@ -162,6 +162,9 @@ cython_debug/ *.mp4 summary.txt transcript.txt +transcript_timestamps.txt +*.html +*.pkl *.ini test_samples/ *.wav \ No newline at end of file diff --git a/42min-StartupsTechTalk-AGENDA-FULL.txt b/42min-StartupsTechTalk-AGENDA-FULL.txt deleted file mode 100644 index 8ad3ff1c..00000000 --- a/42min-StartupsTechTalk-AGENDA-FULL.txt +++ /dev/null @@ -1,47 +0,0 @@ -AGENDA: Most important things to look for in a start up - -TAM: Make sure the market is sufficiently large than once they win they can get rewarded -- Medium sized markets that should be winner take all can work -- TAM needs to be realistic of direct market size - -Product market fit: Being in a good market with a product than can satisfy that market -- Solves a problem -- Builds a solution a customer wants to buy -- Either saves the customer something (time/money/pain) or gives them something (revenue/enjoyment) - -Unit economics: Profit for delivering all-in cost must be attractive (% or $ amount) -- Revenue minus direct costs -- Raw input costs (materials, variable labour), direct cost of delivering and servicing the sale -- Attractive as a % of sales so it can contribute to fixed overhead -- Look for high incremental contribution margin - -LTV CAC: Life-time value (revenue contribution) vs cost to acquire customer must be healthy -- LTV = Purchase value x number of purchases x customer lifespan -- CAC = All-in costs of sales + marketing over number of new customer additions -- Strong reputation leads to referrals leads to lower CAC. Want customers evangelizing product/service -- Rule of thumb higher than 3 - -Churn: Fits into LTV, low churn leads to higher LTV and helps keep future CAC down -- Selling to replenish revenue every year is hard -- Can run through entire customer base over time -- Low churn builds strong net dollar retention - -Business: Must have sufficient barriers to entry to ward off copy-cats once established -- High switching costs (lock-in) -- Addictive -- Steep learning curve once adopted (form of switching cost) -- Two sided liquidity -- Patents, IP, Branding -- No hyper-scaler who can roll over you quickly -- Scale could be a barrier to entry but works against most start-ups, not for them -- Once developed, answer question: Could a well funded competitor starting up today easily duplicate this business or is it cheaper to buy the start up? - -Founders: Must be religious about their product. Believe they will change the world against all odds. -- Just money in the bank is not enough to build a successful company. Just good tech not enough -to build a successful company -- Founders must be motivated to build something, not (all) about money. They would be doing -this for free because they believe in it. Not looking for quick score -- Founders must be persuasive. They will be asking others to sacrifice to make their dream come -to life. They will need to convince investors this company can work and deserves funding. -- Must understand who the customer is and what problem they are helping to solve. -- Founders aren’t expected to know all the preceding points in this document but have an understanding of most of this, and be able to offer a vision. \ No newline at end of file diff --git a/README.md b/README.md index f016e26f..4427840a 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ The target deliverable is a local-first live transcription and visualization too To setup, -1) Check values in config.ini file. Specifically add your OPENAI_APIKEY. +1) Check values in config.ini file. Specifically add your OPENAI_APIKEY if you plan to use OpenAI API requests. 2) Run ``` export KMP_DUPLICATE_LIB_OK=True``` in Terminal. [This is taken care of in code, but not reflecting, Will fix this issue later.] 3) Run the script setup_depedencies.sh. @@ -31,9 +31,63 @@ To setup, ``` python3 whisjax.py "https://www.youtube.com/watch?v=ihf0S97oxuQ" --transcript transcript.txt summary.txt ``` +You can even run it on local file or a file in your configured S3 bucket + +``` python3 whisjax.py "startup.mp4" --transcript transcript.txt summary.txt ``` + +The script will take care of a few cases like youtube file, local file, video file, audio-only file, +file in S3, etc. + 5) ``` pip install -r requirements.txt``` +**S3 bucket:** + +S3 bucket name is mentioned in config.ini. All transfers will happen between this bucket and the local computer where the +script is run. You need AWS_ACCESS_KEY / AWS_SECRET_KEY to authenticate your calls to S3 (config.ini). + +For AWS S3 Web UI, +1) Login to AWS management console. +2) Search for S3 in the search bar at the top. +3) Navigate to list buckets, if needed and choose your bucket (reflector-bucket) +4) You should be able to see items in the bucket. You can upload/download here. + +Through CLI, +Refer to the FILE UTIL section below. + + +**FILE UTIL MDOULE:** + +A file_util module has been created to upload/download files with AWS S3 bucket pre-configured using config.ini. +If you need to upload / download file, separately on your own, apart from the pipeline workflow in the script, +you can do so by : + +Upload: + +``` python3 file_util.py upload ``` + +Download: + +``` python3 file_util.py download ``` + + +**WORKFLOW:** + +1) Specify the input source file from local, youtube link or upload to S3 if needed and pass it as an input to the script. +2) Keep the agenda header topics in a local file named "agenda-headers.txt". This needs to be present where the script is run. +3) Run the script. The script automatically creates a scatter plot of words and topics in the form of an interactive +HTML file, a sample word cloud and uploads them to the S3 bucket +4) Additional artefacts pushed to S3: + 1) HTML visualiztion file + 2) pandas df in pickle format for others to colloborate and make their own visualizations + 3) Summary, transcript and transcript with timestamps file in txt format. + + The script also creates 2 types of mappings. + 1) Timestamp -> The top 2 matched agenda topic + 2) Topic -> All matched timestamps in the transcription + +Further visualizations can be planned based on available artefacts or new ones can be created. + NEXT STEPS: diff --git a/config.ini b/config.ini index 027896ff..9d4a1d6a 100644 --- a/config.ini +++ b/config.ini @@ -2,9 +2,9 @@ # Set exception rule for OpenMP error to allow duplicate lib initialization KMP_DUPLICATE_LIB_OK=TRUE # Export OpenAI API Key -OPENAI_APIKEY=***REMOVED*** +OPENAI_APIKEY= # Export Whisper Model Size WHISPER_MODEL_SIZE=tiny -AWS_ACCESS_KEY= -AWS_SECRET_KEY= +AWS_ACCESS_KEY=***REMOVED*** +AWS_SECRET_KEY=***REMOVED*** BUCKET_NAME='reflector-bucket' \ No newline at end of file