Quantcast
Channel: Active questions tagged blazor - Stack Overflow
Viewing all articles
Browse latest Browse all 4839

How can i properly extract text from a PDF File, to store it in a Elastic-Search Index?

$
0
0

I'm working on a C#/Blazor project that extracts text from PDF files and stores it in an Elasticsearch index for full-text search.

The issue: when the PDF was originally a PowerPoint presentation, the layout and paragraph structure are not preserved. For example, bullet points and headings are extracted line-by-line without any clear structure.

Is there a way to preserve the logical layout when extracting text from these PDFs?

I'm using PdfTextExtractor.GetTextFromPage() from iText.Kernel.Pdf.Canvas.Parser, but it just reads every line individually. This makes the text hard to search and read once indexed.

try{    using var reader = new PdfReader(file.Path);    using var pdf = new PdfDocument(reader);    for (var i = 1; i <= pdf.GetNumberOfPages(); i++)    {        var pageText = PdfTextExtractor.GetTextFromPage(pdf.GetPage(i));        await elasticService.WriteAsync(new ElasticWriteModel        {            DocId = document.Id,            Page = i,            Text = pageText,            AccessLevel = document.AccessLevel == AccessLevel.Private ? 0 : 1,            Timestamp = DateTime.UtcNow        }, cancellationToken);    }}catch (Exception ex){    logger.LogError(ex, "Failed to process document with ID {DocumentId}", document.Id);}

This results then in a strange result like in the image: Here

I've also tried to use the File.ReadAllLines() class from System base class, but that didn't also work.


Viewing all articles
Browse latest Browse all 4839

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>