AWS CloudSearch uploaded PDFs content not indexed

I am attempting to upload PDF to CloudSearch via the console. While the document is added, the content is not effectively searchable. The console generates SDF formatted JSON like this:

[ {
  "type" : "add",
  "id" : "Sample.pdf",
  "fields" : {
    "content_type" : "text/plain",
    "content_encoding" : "windows-1252",
    "resourcename" : "Sample.pdf",
    "content" : "%PDF-1.6\r\nCatalogx^½]ÛrÜ6�}Ÿ¯˜­ÊÃ{...}\r\n%%EOF"
  }
} ]

When I attempt to search for document content, the text readable above ("PDF", "Catalog") appears, but not any of the "useful" content of the document.

I was surprised to see that:

  • the content/type was text/plain instead of appliation/pdf, and
  • the content was not encoded as something like base64

I then hand-crafted my own batch XML file to attempt the same:

<batch>
    <add id="pdftest1">
        <field name="content_type">application/pdf</field>
        <field name="resourcename">Sample1.pdf</field>
        <field name="content">{copied from aws console output}</field>
    </add>
</batch>

and

<batch>
    <add id="pdftest2">
        <field name="content_type">application/pdf</field>
        <field name="resourcename">Sample2.pdf</field>
        <field name="content">{base64 encoded pdf contents}</field>
    </add>
</batch>

It is possible to have CloudSearch search the "useful" contents of a PDF without converting the PDF to a text file first?

If so, what am I doing wrong?

Edit 6/27/2016

The CloudSearch command line interface generates batches that work by converting the PDF to raw text. Not sure why the AWS CloudSearch console does not do the same.

C:\Downloads>cs-import-documents --source .\Sample.pdf --output .\1.json

produced:

[ {
  "type" : "add",
  "id" : "xmlC:_Downloads_Sample.pdf",
  "fields" : {
    "content_type" : "application/pdf",
    "created" : "Fri Jun 17 11:14:45 EDT 2016",
    "resourcename" : "Sample.pdf",
    "content" : "6/17/2016 [... remaining text omitted for brevity ...]
  }
} ]

The AWS documentation includes:

Amazon CloudSearch console provide a way to automatically generate properly formatted JSON or XML from several common file types: PDF, Microsoft Excel, Microsoft PowerPoint, Microsoft Word, CSV, text, and HTML.

This appears to be incorrect as of 6/24/2016 (or I've missed something in my usage of the console).

This leaves me with an alternate question: what is a reasonably efficient way to daily get several hundred new PDFs located in an S3 bucket into CloudSearch? Specifically:

  • Does the CloudSearch API offer the "pdf-to-text" as part of their API?
  • Must I use the CS CLI to perform the conversion?

If the CLI is the recommended way to go, that seems inefficient in that (I assume) the CLI must pull the PDF from S3, convert to text, and then push the resulting SDF to CloudSearch. It seems ... odd that AWS would not provide an API call against CS that would do precisely this for me. Perhaps they do offer it and I'm missing it?

Answers


Same problem here. I am working on document management project (C#, WPF) and want to indexed large Amount of PDFs on CloudSearch from S3.

Following Process worked for me as my requirement. I'm not able to find any other solution.

  • Manual configure the index
    • Example fields : 'filename','text','path','modifieddate'
  • Code to add document to CloudSearch

    // Find all files in root folder create index on them
            List<string> lstFiles = listAllFilesOnCloud("[BucketName]");
    
            foreach (string strFile in lstFiles)
            {
                string FileName = System.IO.Path.GetFileNameWithoutExtension(strFile);
                string Text = ExtractTextFromPdf("https://s3.amazonaws.com/" + strFile);
                string Path = strFile;
                DateTime ModifiedDate = DateTime.Now;
    
    
                string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
                foreach (var docs in ltDocumentTypes)
                {
                    if (headerText.ToUpper().Contains(docs.searchText.ToUpper()))
                    {
                        DocumentType = docs.DocumentType;
                        Vault = docs.VaultName; ;
                    }
                }
    
                if (string.IsNullOrEmpty(DocumentType))
                {
                    DocumentType = "Default";
                    Vault = "Default";
                }
    
                IndexDocument docDetail = new IndexDocument();
                docDetail.filename = FileName;
                docDetail.text = Text;
                docDetail.path = Path;
                docDetail.modifieddate = ModifiedDate;
    
                UploadDcoumentOnCloudSearch(docDetail);
            }
    

Used ITexSharp to exract text form pdf.


Finally, I was able to get it to work! The way it worked for me is to us cs-import-documents AWS Cloudsearch command. cs-import-documents --source "c:\test.pdf" --output "C:\test.sdf"

It produced a .json file. I uploaded this to the Cloudsearch through the console and the search provided results.

Good luck, Raj


Need Your Help

Dynamic Ad Code Replacing Page

javascript ajax javascript-events jquery

I have an application that uses a mobile ad provider; the way ad provider works is that I make a request on the server side, the provider returns me the mark-up and I include that on my site.

Prevent user from manipulating query string parameter

php security session concurrency query-string

Situation: Unregistered user visits website and issues a request for an item. As per the current data flow, this request gets inserted in db first and the request id is carried over in the url of ...