Read PDF to Text using Google Apps Script Drive V3 Service

Simplifying Google Apps Script to Convert PDF to Text

please note Google’s OCR can read the text. The quality of the OCR results can vary based on the PDF content, such as images or scanned documents.

You need to enable Drive Service V3

https://developers.google.com/apps-script/advanced/drive

In this blog post, we’ll explore how to use Google Apps Script to convert a PDF into a text file using Optical Character Recognition (OCR) and Google Drive’s capabilities. This script provides a practical solution for those needing to extract text from PDFs stored in Google Drive, making it usable for tasks like data processing or content migration.

Overview of the Solution

The script leverages Google Drive’s OCR feature to convert a PDF into a Google Document, from which text can then be extracted and saved as a text file. This process involves:

  1. Converting the PDF to a Google Document using OCR.
  2. Extracting text from the created Google Document.
  3. Saving the extracted text as a plain text file in Google Drive.
  4. Cleaning up by deleting the temporary Google Document.

Prerequisites

Before implementing the script, ensure that the Google Drive API is enabled in your Google Apps Script environment:

  • Go to your Apps Script project.
  • Click on Services and add the Google Drive API.

The Simplified Script

Below is a streamlined version of the script, focused on clarity and brevity:

function convertPDFToText() {
const fileId = 'YOUR_PDF_FILE_ID_HERE'; // Replace with your PDF file ID
const ocrLanguage = 'en'; // Set OCR language to English

// Convert PDF to Google Doc using OCR
const pdfBlob = DriveApp.getFileById(fileId).getBlob();
const doc = Drive.Files.create({
name: pdfBlob.getName().replace(/\.pdf$/, ''),
mimeType: MimeType.GOOGLE_DOCS
}, pdfBlob, {
ocr: true,
ocrLanguage: ocrLanguage
});

// Extract text and save as a text file
const text = DocumentApp.openById(doc.id).getBody().getText();
DriveApp.createFile(doc.name + '.txt', text, MimeType.PLAIN_TEXT);

// Delete the temporary Google Doc
DriveApp.getFileById(doc.id).setTrashed(true);

return text;
}

Explanation of the Script

  1. Convert PDF to Google Doc: The script starts by obtaining the blob of the PDF file using its file ID. It then creates a new Google Document from this blob using OCR. The document name is derived by removing the .pdf extension from the original file name.
  2. Extract Text: Once the Google Document is created, the script opens this document and extracts its text content.
  3. Save Text as a File: The extracted text is saved into a new text file in Google Drive. The file name is set as the original document name with a .txt extension added.
  4. Clean-Up: The temporary Google Document is deleted after the text extraction to prevent cluttering your Google Drive.

Benefits and Limitations

  • Benefits: The script is easy to set up and run, integrates seamlessly with Google Drive, and doesn’t require any external libraries or APIs apart from Google’s own services.
  • Limitations: The OCR’s accuracy depends on the quality of the PDF. It works best with text-based PDFs and might struggle with scanned documents or images.

This simplified script provides a quick and efficient way to convert PDF files into editable text within the Google Drive ecosystem, making it a valuable tool for many applications.

Longer version

function convertPDFToText() {
  const fileId = '1Dr7****76'; // Sample PDF file
  const ocrLanguage = 'en'; // OCR language set to English

  try {
    // Fetch the PDF file from Google Drive
    const pdfFile = DriveApp.getFileById(fileId);
    const pdfName = pdfFile.getName();
    const pdfBlob = pdfFile.getBlob();

    // Convert PDF to a Google Document using OCR
    const googleDoc = createGoogleDocFromPDF(pdfName, pdfBlob, ocrLanguage);

    // Extract text content from the Google Document
    const textContent = extractTextFromGoogleDoc(googleDoc.id);

    // Optional: Save the extracted text as a text file in Google Drive
    saveTextToFile(`${googleDoc.name}.txt`, textContent);

    // Clean up by deleting the temporary Google Document
    DriveApp.getFileById(googleDoc.id).setTrashed(true);

    return textContent;
  } catch (error) {
    Logger.log('Error converting PDF to text: ' + error.toString());
    return null;
  }
}

function createGoogleDocFromPDF(pdfName, pdfBlob, ocrLanguage) {
  const docName = pdfName.replace(/\.pdf$/, '');
  const resource = {
    name: docName,
    mimeType: MimeType.GOOGLE_DOCS
  };
  const options = {
    ocr: true,
    ocrLanguage: ocrLanguage,
    fields: 'id,name'
  };

  // Create a Google Document from the PDF blob using OCR
  const { id, name } = Drive.Files.create(resource, pdfBlob, options);
  return { id, name };
}

function extractTextFromGoogleDoc(docId) {
  // Open the Google Document and extract the text
  const googleDoc = DocumentApp.openById(docId);
  return googleDoc.getBody().getText();
}

function saveTextToFile(fileName, textContent) {
  // Create a new text file in Google Drive with the extracted text content
  DriveApp.createFile(fileName, textContent, MimeType.PLAIN_TEXT);
}