Full Stack Machine Learning

Full Stack Machine Learning Journal of adventures in building machine learning systems, data science, and MLOps. https://leehanchung.github.io Information Trifecta Novelty, Utility, Credibility - Choose Two - With the advent of large language models and other foundational models, the cost of generating new content is rapidly approaching zero. At one end of the spectrum, OpenAI is collaborating with Hollywood using its Sora model to create video content for studios, while companies like Suno and Udio are commoditizing... Tue, 23 Apr 2024 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2024/04/23/information-trifecta/ https://leehanchung.github.io/blogs/blog/2024/04/23/information-trifecta/ Service as Software The Model IS the Product - Forget softwares or software-as-a-service. The future belongs to Service-as-Software (SaS), where customers don’t buy tools; they hire agents. Imagining ditching travel booking website and instead use a personal travel agent AI to curate, plan, and manage your vacation. That is the power of Service-as-Software, and it’s powered by Large Language... Thu, 01 Feb 2024 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2024/02/01/service-as-software/ https://leehanchung.github.io/blogs/blog/2024/02/01/service-as-software/ Subject: search lingo Understanding Search Engines from United States vs Google - Court documents sometimes provides extremely detailed insights within tech companies. Below is the email exchange between John Giannandrea of Apple and Adrian Perica of Apple on the subject of Search, unadulterated by any LLMs. Crawl/Index A search index is essentially a copy of the web But the web is really... Wed, 03 Jan 2024 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2024/01/03/search-lingo/ https://leehanchung.github.io/blogs/blog/2024/01/03/search-lingo/ Decoding LLM Pricing: The Cost of Input and Output Sequences in Generative AI Understanding Large Language Model Inference and Its Impact on Cost - Introduction In the rapidly evolving domain of Large Language Models (LLM), a hot topic has emerged: the pricing models of LLM services. A notable trend is the cost associated with longer output sequences, a seemingly straightforward concept as more extensive outputs demand more inference time. However, a less intuitive aspect... Mon, 13 Nov 2023 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2023/11/13/llm-inference-cost/ https://leehanchung.github.io/blogs/blog/2023/11/13/llm-inference-cost/ Unlocking GPT Function Calls: A Hands-On Guide about OpenAI's Answer to Toolformer Enhance Language Models with Real-Time Data and Actions - GPT Function Call A constraint of large language models (LLMs) is that their knowledge is confined to the data they were trained on, and their functionality is limited to predicting the next token in a sequence. To address this, various initiatives have aimed to equip LLMs with tool-use capabilities, such... Mon, 04 Sep 2023 00:00:00 +0000 https://leehanchung.github.io/notebooks/notebook/2023/09/04/gpt-chat-tool-use/ https://leehanchung.github.io/notebooks/notebook/2023/09/04/gpt-chat-tool-use/ Data Leakage in WizardCoder: A LeetCode Reality Check Intentional feature leakage via optimization on the test set - Introduction Since the launch of Code Llama on August 24, 2023, there has been a surge in fine-tuning efforts within the pseudo-open-source community. Just two days after the release, Microsoft’s WizardLM Team announced that their fine-tuned WizardCoder 34B, based on Code Llama, achieved a HumanEval score of 73.2 for pass@1.... Fri, 01 Sep 2023 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2023/09/01/data-leakage-in-wizard-coder/ https://leehanchung.github.io/blogs/blog/2023/09/01/data-leakage-in-wizard-coder/ The Most Desired & Competitive Skills for MLEs in 2023 Arize:Observe 2023 - I had the opportunity to give a talk at Arize Observe 2023 on the topic of desired and competitive skils for machine learning engineers at 2023. Focus on the first principals of engineering. Wed, 26 Apr 2023 00:00:00 +0000 https://leehanchung.github.io/talks/2023/04/26/machine-learning-engineering-career/ https://leehanchung.github.io/talks/2023/04/26/machine-learning-engineering-career/ CUDA x WSL2 The Ultimate Guide to Bang Your Head Against the Wall - Introduction WSL2 provides a way to run a Linux environment on a Windows machine, making it possible to use Linux tools and applications natively within Windows. However, due to the differences between Linux and Windows, getting CUDA to work under WSL2 can be challenging. Despite the challenges, it’s important to... Wed, 29 Mar 2023 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2023/03/29/CUDA-x-WSL2/ https://leehanchung.github.io/blogs/blog/2023/03/29/CUDA-x-WSL2/ Prompt Engineering Guide GPT-4 on how to better use GPT-4 in ChatGPT - Introduction This guide aims to help users of ChatGPT better utilize the service by teaching them how to create effective prompts. By understanding various methods to instruct and steer ChatGPT, users can receive more accurate and helpful responses to complete their intended tasks. Crafting Effective Prompts Clarity and Specificity To... Fri, 17 Mar 2023 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2023/03/17/Prompt-Engineering-Guide/ https://leehanchung.github.io/blogs/blog/2023/03/17/Prompt-Engineering-Guide/ Notes from SIGIR 2022 Resource sharing and observations - KeyNotes: :movie_camera: From Search Queries to Conversations in the Design of Information Retrieval and Access Systems :movie_camera: I3A: An Intelligent Interactive Information Agent Model for Information Retrieva :movie_camera: Few-shot Information Extraction is here: Pre-train, Prompt and Entail :movie_camera: Intelligent Conversational Agents for Ambient Computing :movie_camera: Search at Bloomberg: Challenges, Opportunities... Sun, 02 Oct 2022 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2022/10/02/SIGIR-2022/ https://leehanchung.github.io/blogs/blog/2022/10/02/SIGIR-2022/ Notes from ACM RecSys 2022 Resource sharing and observations - Notable Papers: A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation Note: fancy models do not necessary provide best results across the board. Paper: :page_facing_up: Talk: :movie_camera: Denoising Self-Attentive Sequential Recommendation Note: differentiable masking for regularization. Not sure if this is borrowed from NLP space. Paper: :page_facing_up: Talk:... Sat, 01 Oct 2022 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2022/10/01/RecSys-2022/ https://leehanchung.github.io/blogs/blog/2022/10/01/RecSys-2022/ Notes on from EMNLP 2021 Tools, Adversarial, and Biases - LMdiff: A Visual Diff Tool to Compare Language Models code demo Comment: Would be interesting to use the tool to drill on language model memorizations Notes: visualization by compares internal states of language models to see the differences of the inferenced results and how the distributino differs. LightTag: Text Annotation... Fri, 19 Nov 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/11/19/EMNLP-2021/ https://leehanchung.github.io/blogs/blog/2021/11/19/EMNLP-2021/ AWS Lambda: Keeping Functions Warm Built-in keep-alive for Lambda functions at deployment - AWS generally terminates Lambda functions after 30-60 mins of inactivity and sometimes shorter. The long wake-up time could create service reliability issues during times when usages are low. And if the Lambda function is deployed using Docker images, the warm up time will literally be in minutes even for simple... Fri, 17 Sep 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/09/17/keeping-aws-lambda-function-warm/ https://leehanchung.github.io/blogs/blog/2021/09/17/keeping-aws-lambda-function-warm/ AWS Sagemaker: Persist Custom Conda Environment Get your own environment for reproducibility - Sagemaker Notebook instances are expensive. Lifecycle Configuration AWS generously provided the scripts to facilitate stopping Sagemaker Notebook instance automatically using a background cron job. on start #!/bin/bash set -e # OVERVIEW # This script installs a custom, persistent installation of conda on the Notebook Instance's EBS volume, and ensures #... Sun, 15 Aug 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/08/15/sagemaker-custom-conda/ https://leehanchung.github.io/blogs/blog/2021/08/15/sagemaker-custom-conda/ List All Columns in a Table in Redshift Useful SQL queries, part 1 - This query lists all columns in a specific table on AWS Redshift. SELECT ordinal_position AS position, column_name, data_type, CASE WHEN character_maximum_length IS NOT NULL THEN character_maximum_length ELSE numeric_precision END AS max_length, is_nullable, column_default AS default_value FROM information_schema.columns WHERE table_name = 'sample_table_name' -- enter table name here --AND table_schema = 'sample_schema_name'... Tue, 10 Aug 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/08/10/useful-redshift-sql/ https://leehanchung.github.io/blogs/blog/2021/08/10/useful-redshift-sql/ Pandas fillna for Categorical Columns Solving ValueError: fill value must be in categories - From time to time, our favorite machine learning modeling libraries will throw errors complaining about NaNs when didnt scrub our data clean enough. The most instictive thing to try is df.fillna(inplace=True) However,pandas will throw a ValueError: fill value must be in categories for categorical columns. Pandas throws this error because... Mon, 09 Aug 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/08/09/pandas-fillna/ https://leehanchung.github.io/blogs/blog/2021/08/09/pandas-fillna/ Syncing Conda Environments with requirements.txt Aligning Python virtual environments - There are varieties of different Python virtual environment managers out there - conda, pipenv, poetry, virtualenv, etc. And from time to time, we would need to convert or switch between different virtual environment tools. For example, AWS SageMaker deals with specifically conda while most software engineering deals with pipenv, poetry,... Wed, 04 Aug 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/08/04/conda-requirements/ https://leehanchung.github.io/blogs/blog/2021/08/04/conda-requirements/ XGBoost Hyperparameters Tuning Research Paper Edition - Recently there’s quite a few research learning papers detailing work in attempt to unseat XGBoost from the crown of the best model for tabular data. Though none has worked well enough, the biggest contributions of these papers are the XGBoost hyperparameter ranges that they used to tune the models for... Fri, 25 Jun 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/06/25/xgboost-hyperparameter-tuning/ https://leehanchung.github.io/blogs/blog/2021/06/25/xgboost-hyperparameter-tuning/ SageMaker: Automatically Stop Notebook and Studio Instances When Idle Terminate SageMaker instances before it terminates your budget - SageMaker Notebook instances and SageMaker Studios are expensive. Insanely expensive especially if you leave them unterminated. Fortunately, there are ways to set up auto-shutdown of both SageMaker Notebook and SageMaker Studio instances when they are idling. SageMaker Notebook Instance AWS generously provided the scripts to facilitate stopping SageMaker Notebook instance... Mon, 14 Jun 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/06/14/sagemaker-auto-terminate/ https://leehanchung.github.io/blogs/blog/2021/06/14/sagemaker-auto-terminate/ Getting Months from a Date Range in Pandas Use period_range to make Pandas inclusive - Those who work with time series or time dependent data sometimes need to get the months between two dates. Naturally we would incline to use pandas.date_range. However, pandas date_range does not enclude the end month. In the following example, the end month 2021-06 is not included Pandas’s output, with closed... Fri, 04 Jun 2021 00:00:00 +0000 https://leehanchung.github.io/blogs/blog/2021/06/04/pandas-date-ranges/ https://leehanchung.github.io/blogs/blog/2021/06/04/pandas-date-ranges/