Document loaders
info
If you'd like to write your own document loader, see this how-to. If you'd like to contribute an integration, see Contributing integrations.
Features
The following table shows the feature support for all document loaders.
| Document Loader | Description | Lazy loading | Native async support | 
|---|---|---|---|
| AZLyricsLoader | Load AZLyricswebpages. | ✅ | ✅ | 
| AcreomLoader | Load acreomvault from a directory. | ✅ | ❌ | 
| AirtableLoader | Load the Airtabletables. | ✅ | ❌ | 
| AmazonTextractPDFLoader | Load PDFfiles from a local file system, HTTP or S3. | ✅ | ❌ | 
| ApifyDatasetLoader | Load datasets from Apifyweb scraping, crawling, and data extraction platform. | ❌ | ❌ | 
| ArcGISLoader | Load records from an ArcGIS FeatureLayer. | ✅ | ❌ | 
| ArxivLoader | Load a query result from Arxiv. | ✅ | ❌ | 
| AssemblyAIAudioLoaderById | ✅ | ❌ | |
| AssemblyAIAudioTranscriptLoader | Load AssemblyAI audio transcripts. | ✅ | ❌ | 
| AstraDBLoader | .. deprecated:: 0.0.29 Use langchain_astradb.AstraDBLoaderinstead. | ✅ | ✅ | 
| AsyncChromiumLoader | Scrape HTML pages from URLs using a | ✅ | ✅ | 
| AsyncHtmlLoader | Load HTMLasynchronously. | ✅ | ✅ | 
| AthenaLoader | Load documents from AWS Athena. | ✅ | ❌ | 
| AzureAIDataLoader | Load from Azure AI Data. | ✅ | ❌ | 
| AzureAIDocumentIntelligenceLoader | Load a PDF with Azure Document Intelligence. | ✅ | ❌ | 
| AzureBlobStorageContainerLoader | Load from Azure Blob Storagecontainer. | ❌ | ❌ | 
| AzureBlobStorageFileLoader | Load from Azure Blob Storagefiles. | ❌ | ❌ | 
| BSHTMLLoader | Load HTMLfiles and parse them withbeautiful soup. | ✅ | ❌ | 
| BibtexLoader | Load a bibtexfile. | ✅ | ❌ | 
| BigQueryLoader | .. deprecated:: 0.0.32 Use langchain_google_community.BigQueryLoaderinstead. | ❌ | ❌ | 
| BiliBiliLoader | ❌ | ❌ | |
| BlackboardLoader | Load a Blackboardcourse. | ✅ | ✅ | 
| BlockchainDocumentLoader | Load elements from a blockchain smart contract. | ❌ | ❌ | 
| BraveSearchLoader | Load with Brave Searchengine. | ✅ | ❌ | 
| BrowserbaseLoader | Load pre-rendered web pages using a headless browser hosted on Browserbase. | ✅ | ❌ | 
| BrowserlessLoader | Load webpages with Browserless/content endpoint. | ✅ | ❌ | 
| CSVLoader | Load a CSVfile into a list of Documents. | ✅ | ❌ | 
| CassandraLoader | ✅ | ✅ | |
| ChatGPTLoader | Load conversations from exported ChatGPTdata. | ❌ | ❌ | 
| CoNLLULoader | Load CoNLL-Ufiles. | ❌ | ❌ | 
| CollegeConfidentialLoader | Load College Confidentialwebpages. | ✅ | ✅ | 
| ConcurrentLoader | Load and pars Documents concurrently. | ✅ | ❌ | 
| ConfluenceLoader | Load Confluencepages. | ✅ | ❌ | 
| CouchbaseLoader | Load documents from Couchbase. | ✅ | ❌ | 
| CubeSemanticLoader | Load Cube semantic layermetadata. | ✅ | ❌ | 
| DataFrameLoader | Load PandasDataFrame. | ✅ | ❌ | 
| DatadogLogsLoader | Load Datadoglogs. | ❌ | ❌ | 
| DedocAPIFileLoader | ✅ | ❌ | |
| DedocFileLoader | ✅ | ❌ | |
| DedocPDFLoader | ✅ | ❌ | |
| DiffbotLoader | Load Diffbotjson file. | ❌ | ❌ | 
| DirectoryLoader | Load from a directory. | ✅ | ❌ | 
| DiscordChatLoader | Load Discordchat logs. | ❌ | ❌ | 
| DocugamiLoader | .. deprecated:: 0.0.24 Use docugami_langchain.DocugamiLoaderinstead. | ❌ | ❌ | 
| DocusaurusLoader | Load from Docusaurus Documentation. | ✅ | ✅ | 
| Docx2txtLoader | Load DOCXfile usingdocx2txtand chunks at character level. | ❌ | ❌ | 
| DropboxLoader | Load files from Dropbox. | ❌ | ❌ | 
| DuckDBLoader | Load from DuckDB. | ❌ | ❌ | 
| EtherscanLoader | Load transactions from Ethereummainnet. | ✅ | ❌ | 
| EverNoteLoader | Load from EverNote. | ✅ | ❌ | 
| FacebookChatLoader | Load Facebook Chatmessages directory dump. | ✅ | ❌ | 
| FaunaLoader | Load from FaunaDB. | ✅ | ❌ | 
| FigmaFileLoader | Load Figmafile. | ❌ | ❌ | 
| FireCrawlLoader | Load web pages as Documents using FireCrawl. | ✅ | ❌ | 
| GCSDirectoryLoader | .. deprecated:: 0.0.32 Use langchain_google_community.GCSDirectoryLoaderinstead. | ❌ | ❌ | 
| GCSFileLoader | .. deprecated:: 0.0.32 Use langchain_google_community.GCSFileLoaderinstead. | ❌ | ❌ | 
| GeoDataFrameLoader | Load geopandasDataframe. | ✅ | ❌ | 
| GitHubIssuesLoader | Load issues of a GitHub repository. | ✅ | ❌ | 
| GitLoader | Load Gitrepository files. | ✅ | ❌ | 
| GitbookLoader | Load GitBookdata. | ✅ | ✅ | 
| GithubFileLoader | Load GitHub File | ✅ | ❌ | 
| GlueCatalogLoader | Load table schemas from AWS Glue. | ✅ | ❌ | 
| GoogleApiYoutubeLoader | Load all Videos from a YouTubeChannel. | ❌ | ❌ | 
| GoogleDriveLoader | .. deprecated:: 0.0.32 Use langchain_google_community.GoogleDriveLoaderinstead. | ❌ | ❌ | 
| GoogleSpeechToTextLoader | .. deprecated:: 0.0.32 Use langchain_google_community.SpeechToTextLoaderinstead. | ❌ | ❌ | 
| GutenbergLoader | Load from Gutenberg.org. | ❌ | ❌ | 
| HNLoader | Load Hacker Newsdata. | ✅ | ✅ | 
| HuggingFaceDatasetLoader | Load from Hugging Face Hubdatasets. | ✅ | ❌ | 
| HuggingFaceModelLoader | ✅ | ❌ | |
| IFixitLoader | Load iFixitrepair guides, device wikis and answers. | ❌ | ❌ | 
| IMSDbLoader | Load IMSDbwebpages. | ✅ | ✅ | 
| ImageCaptionLoader | Load image captions. | ❌ | ❌ | 
| IuguLoader | Load from IUGU. | ❌ | ❌ | 
| JSONLoader | ✅ | ❌ | |
| JoplinLoader | Load notes from Joplin. | ✅ | ❌ | 
| KineticaLoader | Load from KineticaAPI. | ✅ | ❌ | 
| LLMSherpaFileLoader | Load Documents using LLMSherpa. | ✅ | ❌ | 
| LakeFSLoader | Load from lakeFS. | ❌ | ❌ | 
| LarkSuiteDocLoader | Load from LarkSuite(FeiShu). | ✅ | ❌ | 
| MHTMLLoader | Parse MHTMLfiles withBeautifulSoup. | ✅ | ❌ | 
| MWDumpLoader | Load MediaWikidump from anXMLfile. | ✅ | ❌ | 
| MastodonTootsLoader | Load the Mastodon'toots'. | ✅ | ❌ | 
| MathpixPDFLoader | Load PDFfiles usingMathpixservice. | ❌ | ❌ | 
| MaxComputeLoader | Load from Alibaba Cloud MaxComputetable. | ✅ | ❌ | 
| MergedDataLoader | Merge documents from a list of loaders | ✅ | ✅ | 
| ModernTreasuryLoader | Load from Modern Treasury. | ❌ | ❌ | 
| MongodbLoader | Load MongoDB documents. | ❌ | ✅ | 
| NewsURLLoader | Load news articles from URLs using Unstructured. | ✅ | ❌ | 
| NotebookLoader | Load Jupyter notebook(.ipynb) files. | ❌ | ❌ | 
| NotionDBLoader | Load from Notion DB. | ❌ | ❌ | 
| NotionDirectoryLoader | Load Notion directorydump. | ❌ | ❌ | 
| OBSDirectoryLoader | Load from Huawei OBS directory. | ❌ | ❌ | 
| OBSFileLoader | Load from the Huawei OBS file. | ❌ | ❌ | 
| ObsidianLoader | Load Obsidianfiles from directory. | ✅ | ❌ | 
| OneDriveFileLoader | Load a file from Microsoft OneDrive. | ❌ | ❌ | 
| OneDriveLoader | Load from Microsoft OneDrive. | ✅ | ❌ | 
| OnlinePDFLoader | Load online PDF. | ❌ | ❌ | 
| OpenCityDataLoader | Load from Open City. | ✅ | ❌ | 
| OracleAutonomousDatabaseLoader | ❌ | ❌ | |
| OracleDocLoader | Read documents using OracleDocLoader | ❌ | ❌ | 
| OutlookMessageLoader | ✅ | ❌ | |
| PDFMinerLoader | Load PDFfiles usingPDFMiner. | ✅ | ❌ | 
| PDFMinerPDFasHTMLLoader | Load PDFfiles as HTML content usingPDFMiner. | ✅ | ❌ | 
| PDFPlumberLoader | Load PDFfiles usingpdfplumber. | ❌ | ❌ | 
| PagedPDFSplitter | Load PDF using pypdf into list of documents. | ✅ | ❌ | 
| PebbloSafeLoader | Pebblo Safe Loader class is a wrapper around document loaders enabling the data | ✅ | ❌ | 
| PlaywrightURLLoader | Load HTMLpages withPlaywrightand parse withUnstructured. | ✅ | ✅ | 
| PolarsDataFrameLoader | Load PolarsDataFrame. | ✅ | ❌ | 
| PsychicLoader | Load from Psychic.dev. | ✅ | ❌ | 
| PubMedLoader | Load from the PubMedbiomedical library. | ✅ | ❌ | 
| PyMuPDFLoader | Load PDFfiles usingPyMuPDF. | ✅ | ❌ | 
| PyPDFDirectoryLoader | Load a directory with PDFfiles usingpypdfand chunks at character level. | ❌ | ❌ | 
| PyPDFLoader | Load PDF using pypdf into list of documents. | ✅ | ❌ | 
| PyPDFium2Loader | Load PDFusingpypdfium2and chunks at character level. | ✅ | ❌ | 
| PySparkDataFrameLoader | Load PySparkDataFrames. | ✅ | ❌ | 
| PythonLoader | Load Pythonfiles, respecting any non-default encoding if specified. | ✅ | ❌ | 
| RSSFeedLoader | Load news articles from RSSfeeds usingUnstructured. | ✅ | ❌ | 
| ReadTheDocsLoader | Load ReadTheDocsdocumentation directory. | ✅ | ❌ | 
| RecursiveUrlLoader | Recursively load all child links from a root URL. | ✅ | ❌ | 
| RedditPostsLoader | Load Redditposts. | ❌ | ❌ | 
| RoamLoader | Load Roamfiles from a directory. | ❌ | ❌ | 
| RocksetLoader | Load from a Rocksetdatabase. | ✅ | ❌ | 
| S3DirectoryLoader | Load from Amazon AWS S3directory. | ❌ | ❌ | 
| S3FileLoader | Load from Amazon AWS S3file. | ✅ | ❌ | 
| SQLDatabaseLoader | ✅ | ❌ | |
| SRTLoader | Load .srt(subtitle) files. | ❌ | ❌ | 
| ScrapflyLoader | Turn a url to llm accessible markdown with Scrapfly.io. | ✅ | ❌ | 
| ScrapingAntLoader | Turn an url to LLM accessible markdown with ScrapingAnt. | ✅ | ❌ | 
| SeleniumURLLoader | Load HTMLpages withSeleniumand parse withUnstructured. | ❌ | ❌ | 
| SharePointLoader | Load  from SharePoint. | ✅ | ❌ | 
| SitemapLoader | Load a sitemap and its URLs. | ✅ | ✅ | 
| SlackDirectoryLoader | Load from a Slackdirectory dump. | ✅ | ❌ | 
| SnowflakeLoader | Load from SnowflakeAPI. | ✅ | ❌ | 
| SpiderLoader | Load web pages as Documents using Spider AI. | ✅ | ❌ | 
| SpreedlyLoader | Load from SpreedlyAPI. | ❌ | ❌ | 
| StripeLoader | Load from StripeAPI. | ❌ | ❌ | 
| SurrealDBLoader | Load SurrealDB documents. | ❌ | ✅ | 
| TelegramChatApiLoader | Load Telegramchat json directory dump. | ❌ | ❌ | 
| TelegramChatFileLoader | Load from Telegram chatdump. | ❌ | ❌ | 
| TelegramChatLoader | Load from Telegram chatdump. | ❌ | ❌ | 
| TencentCOSDirectoryLoader | Load from Tencent Cloud COSdirectory. | ✅ | ❌ | 
| TencentCOSFileLoader | Load from Tencent Cloud COSfile. | ✅ | ❌ | 
| TensorflowDatasetLoader | Load from TensorFlow Dataset. | ✅ | ❌ | 
| TextLoader | Load text file. | ✅ | ❌ | 
| TiDBLoader | Load documents from TiDB. | ✅ | ❌ | 
| ToMarkdownLoader | Load HTMLusing2markdown API. | ✅ | ❌ | 
| TomlLoader | Load TOMLfiles. | ✅ | ❌ | 
| TrelloLoader | Load cards from a Trelloboard. | ✅ | ❌ | 
| TwitterTweetLoader | Load Twittertweets. | ❌ | ❌ | 
| UnstructuredAPIFileIOLoader | .. deprecated:: 0.2.8 Use langchain_unstructured.UnstructuredLoaderinstead. | ✅ | ❌ | 
| UnstructuredAPIFileLoader | .. deprecated:: 0.2.8 Use langchain_unstructured.UnstructuredLoaderinstead. | ✅ | ❌ | 
| UnstructuredCHMLoader | Load CHMfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredCSVLoader | Load CSVfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredEPubLoader | Load EPubfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredEmailLoader | Load email files using Unstructured. | ✅ | ❌ | 
| UnstructuredExcelLoader | Load Microsoft Excel files using Unstructured. | ✅ | ❌ | 
| UnstructuredFileIOLoader | .. deprecated:: 0.2.8 Use langchain_unstructured.UnstructuredLoaderinstead. | ✅ | ❌ | 
| UnstructuredFileLoader | .. deprecated:: 0.2.8 Use langchain_unstructured.UnstructuredLoaderinstead. | ✅ | ❌ | 
| UnstructuredHTMLLoader | Load HTMLfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredImageLoader | Load PNGandJPGfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredMarkdownLoader | Load Markdownfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredODTLoader | Load OpenOffice ODTfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredOrgModeLoader | Load Org-Modefiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredPDFLoader | Load PDFfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredPowerPointLoader | Load Microsoft PowerPointfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredRSTLoader | Load RSTfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredRTFLoader | Load RTFfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredTSVLoader | Load TSVfiles usingUnstructured. | ✅ | ❌ | 
| UnstructuredURLLoader | Load files from remote URLs using Unstructured. | ❌ | ❌ | 
| UnstructuredWordDocumentLoader | Load Microsoft Wordfile usingUnstructured. | ✅ | ❌ | 
| UnstructuredXMLLoader | Load XMLfile usingUnstructured. | ✅ | ❌ | 
| VsdxLoader | ❌ | ❌ | |
| WeatherDataLoader | Load weather data with Open Weather MapAPI. | ✅ | ❌ | 
| WebBaseLoader | Load HTML pages using urlliband parse them with `BeautifulSoup'. | ✅ | ✅ | 
| WhatsAppChatLoader | Load WhatsAppmessages text file. | ✅ | ❌ | 
| WikipediaLoader | Load from Wikipedia. | ✅ | ❌ | 
| XorbitsLoader | Load XorbitsDataFrame. | ✅ | ❌ | 
| YoutubeLoader | Load YouTubevideo transcripts. | ❌ | ❌ | 
| YuqueLoader | Load documents from Yuque. | ❌ | ❌ |